Comparative Analysis and Classification of Cassette Exons

Research ArticleComparative Analysis and Classification of Cassette Exons andConstitutive Exons

Ying Cui12 Meng Cai23 and H Eugene Stanley2

1School of Mechano-Electronic Engineering Xidian University Xirsquoan 710071 China2Center for Polymer Studies and Department of Physics Boston University Boston MA 02215 USA3School of Economics and Management Xidian University Xirsquoan 710071 China

Correspondence should be addressed to Meng Cai mcaibuedu

Received 18 August 2017 Accepted 13 November 2017 Published 4 December 2017

Academic Editor Yudong Cai

Copyright copy 2017 Ying Cui et alThis is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Alternative splicing (AS) is a major engine that drives proteome diversity in mammalian genomes and is a widespread cause ofhuman hereditary diseases More than 95 of genes in the human genome are alternatively spliced and the most common type ofAS is the cassette exon Recent discoveries have demonstrated that the cassette exon plays an important role in genetic diseases Todiscover the formation mechanism of cassette exon events we statistically analyze cassette exons and find that cassette exon eventsare strongly influenced by individual exons that are smaller in size and that have a lower GC content more codon terminations andweaker splice sitesWe propose an improved random-forest-based hybridmethod of distinguishing cassette exons from constitutiveexons Our method achieves a high accuracy in classifying cassette exons and constitutive exons and is verified to outperformprevious approaches It is anticipated that this study will facilitate a better understanding of the underlying mechanisms in cassetteexons

1 Introduction

In most eukaryotic organisms gene expression is regulatedby alternative splicing (AS) which is themajor posttranscrip-tional mechanism that promotes biological complexity ASis a biological process in which one gene produces a varietyof transcript isoforms through the distinct selection of splicesites during pre-mRNA splicing AS plays an important reg-ulatory role in proteome diversity [1 2] Up to 95 of humangenes are alternatively spliced in order to encode proteinswith different functions [3] and approximately 15 of humanhereditary diseases and cancers are caused by AS [4]

Cassette exon splicing also known as exon skipping isthe most prevalent form of alternative splicing in the humangenome and accounts for 50 to 60 percent of all alternativelyspliced events [5] A cassette exon is a splicing event inwhich an intervening exon between two other exons from themature mRNA sequence can be either included or skippedin order to generate two distinct protein isoforms A numberof recent discoveries have concluded that the cassette exonis closely associated with a broad range of human diseases

[6ndash11] such as renal cancer [9] neuromuscular diseases[10] and hearing loss [11] Cassette exons have also beenemployed as a therapeutic strategy for producing the requiredproteins for genetic diseases [10 12] such as the congenitalmyasthenia syndrome [12] Despite its importance becauseof its complexity the cassette exon mechanism is not fullyunderstood and because of the limited availability of accu-rate computational tools genome-wide detection of cassetteexons remains amajor challengeThus we focus our attentionhere on a comparative analysis of the sequence features ofcassette exons and constitutive exons in order to find theregulatory factors that contribute to cassette exon eventsWe also aim to construct a classification model that candistinguish between cassette exons and constitutive exons

Various machine learning approaches are used in cassetteexons identification [13ndash16] Dror et al used support vectormachine (SVM) which is one of the classical machine learn-ing methods to classify cassette exons and constitutive exonsconserved between human and mouse [13] This research ishighly dependent on conservation-based features and thusits application on cassette exons detection is constrained

HindawiBioMed Research InternationalVolume 2017 Article ID 7323508 8 pageshttpsdoiorg10115520177323508

2 BioMed Research International

to exons conserved between human and mouse genomesSu et al also proposed an SVM-based approach whichuses two classifiers [14] The first classifier distinguishesauthentic exons from pseudo exons and the second classifierdistinguishes cryptic exons from constitutive and skippedexons This method did not use conservation informationand therefore can be used more widely and easily Howeverit only can achieve about 70 accuracy Li et al introducedan AdaBoost-based method to identify cassette exons andachieved 7783 accuracy [15] Recently amethod combininga gene expression programming (GEP) algorithm and arandom forest (RF) model was proposed for identifyingcassette exons in the human genome [16] and achieved asignificantly higher accuracy (9087) than previous studiesAlthough this hybrid program is a contribution that allowsthe investigation of cassette exons in human genes its resultsstill need improving This GEP+RF method [16] also cannotspeculate on the regulators of the formation mechanism ofcassette exons events

To understand why cassette exons occur and what kindof exon is more likely to be skipped we here analyze andcompare the sequence features of cassette exons with theconstitutive exons in the human genome We find thatexons with shorter length weaker splice sites lower GCcontent and more termination codons are more likely to beskipped during the splicing process We propose a hybridclassification method based on an improved random forestmodel for distinguishing cassette exons in the human genomefrom constitutive exons Computational simulation resultsindicate that our approach more accurately identifies cassetteexons than previous methods

2 Materials and Methods

21 Datasets Weuse theHEXEvent database which providesa list of human internal exons and reports all known spliceevents as our source of alternative splicing information[17] We downloaded the complete list of cassette exonsand constitutive exons of the first human chromosome fromthe HEXEvent database We also downloaded sequencesof the first human chromosome from the UCSC GenomeBrowser [18] and extracted cassette exon and constitutiveexon sequences using the lists we obtained from HEXEventOnly exons without any other kind of AS events supportedby the ESTs were used for analysis and this produced a finaldataset of 3120 cassette exons and 7316 constitutively splicedexons

22 Feature Extraction and Analysis Here we compute andcomparatively analyze the sequence features of cassette exonsand constitutive exons to find out how they differ and toextract classification features for use in the next step Wecompute 91 features including length nucleotide composition(mononucleotide dinucleotide and trinucleotide) GC con-tent termination codon frequency and splice site strengthand we use the 119905-test to analyze the differences between thetwo groups of data

Although splice site strength signal is typically modeledusing methods such as the first-order Markov model the

weight matrix model and the maximum entropy (MaxEnt)model [19] the first-order Markov model disregards connec-tions between nonadjacent positions and the weight matrixmodel is hindered by the hypothesis that different positionsare independent We thus use MaxEnt model to determinethe strength of the splice site signal

The MaxEnt model obtains the most likely distribu-tion function of observables from statistical systems bymaximizing the entropy under a set of constraints thatmodel the incomplete information about the target distri-bution The MaxEnt model consists of two distributionsthe scilicet signal model (119875+(119909)) and the decoy probabilitydistribution (119875minus(119909))

Let 119901 denote the unknown probability distribution For aspecific DNA sequence x 119901(119909) represents its probability Let119901 be the approximation of119901 where the entropy of119901 is definedas

119867(119901) = minussum119901 (119909) log2 (119901 (119909)) (1)

For a given sequence x the signal strength of 119909 is

119871 (119883 = 119909) = 119901+ (119883 = 119909)119901minus (119883 = 119909) (2)

Here 119901+(119883 = 119909) and 119901minus(119883 = 119909) are the probability of 119909 fromthe distribution of signals (+) and decoys (minus) respectively

Table 1 and Figure 1 show the feature analysis results Wefind that cassette exons are usually shorter in size and havea lower GC content more termination codons and weakersplice sites than constitutive exons The strength of a splicesite determines the skipping level in AS so weak splice sitesare suboptimal for the splicing mechanism Apparently shortlength abundance of terminal codons and weak splice signalhinder the ability of the splicing mechanism to recognizethese exons and the result is exon skipping

23 Classifiers We employ five binary classifiers to distin-guish cassette exons from constitutive exons based on the 91extracted features

231 KNN Classifier The 119896-nearest neighbors (KNN) algo-rithm is a linear pattern recognition method used forclassification and regression [20] For classification KNNcompares a test tuple and a collection of labeled examplesin a training set Each new sample in the prediction set isclassified according to the class of themajority of its 119896-nearestneighbors in the training set Parameter ldquo119870rdquo is the numberof neighbors used to determine the class and it stronglyinfluences the identification rate of the KNN model

232 SVM Classifier The support vector machine (SVM) isa supervised machine learning algorithm based on statisticallearning theory [21] SVM is primarily used to solve classi-fication and regression problems and has been successfullyapplied to bioinformatics tasks such as alternative splice sitesidentification and diagnostic method of diabetes [22 23]SVM uses a nonlinear mapping function to map originaldata into a high-dimensional feature space It then constructsa hyperplane to be the surface that discriminates betweenpositive and negative data

BioMed Research International 3

Table 1 Results of sequence feature analysis

Feature Mean value of cassette exons Mean value of constitutive exons 119901 valueLength 14285 17639 lt22e minus 16GC content 04975 05062 0009886Termination codon 00128 00117 0000600551015840 splice strength minus1343874 minus121145 0000528531015840 splice strength minus1991228 minus1829848 00002665

CA G T0

002

004

006

008

01

AA AG CA CG GA GG TA TG

AA

AA

AG ACA

ACG

AGA

AGG

ATA

ATG

CAA

CAG

CCA

CCG

CGA

CGG

CTA

CTG

GA

AG

AGG

CAG

CGG

GA

GG

GG

TAG

TG TAA

TCA

TAG

TCG

TGA

TGG

TTA

TTG

Cassette exonsConstitutive exons

0

005

01

015

02

025

03

Mea

n oc

curr

ence

0

0005

001

0015

002

0025

003

Mea

n oc

curr

ence

Figure 1 Mean occurrence of mononucleotide dinucleotide and trinucleotide

233 RF Classifier The random forest (RF) algorithm is anensemble machine learning method developed by Breiman[24] It has been widely applied to prediction problems inbioinformatics [25ndash27] The RF algorithm consists of multi-base tree-structured classifiers such as CART (classificationand regression tree) and it is robust to noise is not hinderedby overfitting and is computationally feasible By applyingCART as a base classifier RF collects the outputs of alldecision trees tallies the result and uses the result to classifythe sample

234 CF Classifier The majority-voting rule in the tradi-tional RF algorithm can cause minority categories to bemisclassified Thus an improved RF function the CForest(CF) function has been proposed [28] Unlike the standardRF algorithm based on the CART with its unfair splittingcriterion the CForest function uses a conditional inference

framework to create an unbiased tree base classificationmodel

In the CForest algorithm conditional inference treesare used to evaluate the association between the predictorvariable and the response The CForest algorithm works asfollows (i) The algorithm tests the global null hypothesis ofindependence between any of the input variables and theresponse and continues until this hypothesis is accepted Ifit is not the input variable with the strongest connection tothe response is selected (ii) The selected predictor variable issplit into two disjoint sets (iii) Steps (i) and (ii) are repeated

The most important part of a forest framework is thesplitting objective function In the traditional RF the mostcommon measures are the information gain and the Giniimpurity which are biased towards relevant predictor vari-ables In response Strobl et al [29] proposed adding aconditional permutation importance scheme to the CForest


framework This permutation importance scheme uses afitted forest model to partition the entire feature space andit can be used to demonstrate the influence of a variable andto compute the importance of its permutation conditional onall types of correlated covariates

The CForest framework provides unbiased variableimportancemeasurements and uses conditional permutationfor feature selection Here the importance of a predictorvariable is measured by comparing the accuracy of theprediction before and after the permutation

Let 119861(119905) represent the out-of-bag (oob) sample for a tree twith 119905 isin 1 119899119905119903119890119890 Here ntree is the number of trees inthe forest The oob-prediction accuracy in tree 119905 before thepermutation is

sum119894 119861(119905)119868 (119910119894 = 119910(119905)119894 )100381610038161003816100381610038161003816119861(119905)100381610038161003816100381610038161003816

(3)

where 119910(119905)119894 = 119891(119905)(119909119894) is the predicted class for observation 119894before permutation

After permuting the value of119883119895 the new accuracy is

sum119894 119861(119905)119868 (119910119894 = 119910(119905)119894120587119895|119885)

100381610038161003816100381610038161003816119861(119905)100381610038161003816100381610038161003816

(4)

where 119885 is the remaining predictor variables 119885 = 1198831 119883119895minus1 119883119895+1

Then the variable importance of 119883119895 in tree 119905 can beexpressed

VI(119905) (119883119895) =sum119894 119861(119905)119868 (119910119894 = 119910(119905)119894 )100381610038161003816100381610038161003816119861(119905)100381610038161003816100381610038161003816

minussum119894 119861(119905)119868 (119910119894 = 119910(119905)119894120587119895|119885)

100381610038161003816100381610038161003816119861(119905)100381610038161003816100381610038161003816

(5)

Finally the importance of each variable 119883119895 for the forest iscalculated as an average over all trees

VI (119883119895) =sum119899119905119903119890119890119905=1 VI(119905) (119883119895)

119899119905119903119890119890 (6)

235 XGBoost Classifier EXtreme Gradient Boosting(XGBoost) [30] is a scalable machine learning system for treeboosting and is one of the most popular machine learningmethods in recent years Based on the principle of gradientboosting machines proposed by Friedman [31] XGBoostproduces a boosting ensemble of weak classification treesby optimizing a differentiable loss function with gradientdescent algorithm XGBoost deals with overfitting and iscomputationally efficient

24 Performance Assessment The 119896-fold cross-validationtechnique is commonly used to measure the performance of

a classifier as it overcomes the problem of overfitting The 119896-fold cross-validation method does not use the entire datasetto train the model but randomly splits the data into 119896 equal-sized smaller subsets Of the 119896 subsets a single subset isretained to be the validation data for testing the model andthe remaining 119896 minus 1 subsets are used as training data Thecross-validation process is then repeated 119896 times (the folds)with each of the 119896 subsets used once as test data to assessthe performance of the model We can then use the averageof 119896 results from the fold to assess the performance of theconstructed model

A receiver operating characteristic (ROC) curve is aplot of the performance of a binary classifier system thatshows the true positive rate against the false positive rate fordifferent cut points We often use this plot to calculate thearea under a curve (AUC) to measure the performance of aclassification model The larger the area under the curve (aperfect prediction will make AUC equal to 1) the better thepredictive accuracy of the model

The evaluation indicators we use tomeasure performanceare sensitivity specificity and total accuracy

119878119899 =TP

TP + FN

119878119901 =TN

TN + FP

TA = TP + TNTP + TN + FN + FP

(7)

where TP is the number of correctly recognized positivesFN is the number of positives recognized as negatives FPsignifies the number of negatives recognized as positives andTN is the number of correctly recognized negatives Positivesare cassette exons and negatives are constitutive exons

3 Results

31 Sample Preparation The 91 extracted sequence features(listed in Table 2) are combined in tandem to create aninput classifier vector To eliminate the negative effects ofmagnitude difference in value between different features wescale the feature values to a range of minus1 and 1 Cassette exonsand constitutive exons are considered to be positive andnegative samples respectively In the 10-fold cross-validationthe whole dataset is randomly and equally divided into tensets Each set is used as a testing set and the remaining nineare used for training We use a total of ten testing sets andeach training set is nine times the size of its correspondingtesting set

32 Performance of Different Classifiers To highlight thegood performance in discrimination of cassette exonsand constitutive exons according to sequence features weattempted to compare the classification results from KNNSVM RF CF and XGBoost approaches in this work In theconduction of classifier model parameter tuning is essen-tial for building a binary-class model with high predictiveaccuracy and stability In each classifier parameters wereoptimized according to the prediction performance they


Table 2 List of extracted features

Feature subset Number of featuresLength 1Mononucleotide 4Dinucleotide 16Trinucleotide 64Termination codon 3GC content 1Splice site strength 2Total 91

eventually achieve A parameter will be settled when itreaches the best prediction performance

In the KNN model we search the optimal parameter 119870value by using a 10-fold cross-validation We estimate thepredictive errors for a given set of 119870 values using cross-validation and select the 119870 value that allows the subsequentKNN classification to yield optimal resultsThus we optimize20 K values (119870 = 1 2 20) using the 10-fold cross-validationThe KNN classifier achieves optimal performancewhen 119870 = 7

In the SVM model we examine the radial basis func-tion (RBF) kernel the linear kernel the polynomial kerneland the sigmoid kernel Each kernel function has severalparameters and properly tuning them significantly affectsthe classification performance of the SVM model We use agrid-search technique and 10-fold cross-validation to searchthe optimal parameter values of SVM using the four kernelsTable 3 supplies the predictive performance results of the fourkernels in the SVMmodel Of the four the RBF kernel showsthe best predictive accuracy and we choose it to be the basickernel function of the SVMclassifierThere are two importantparameters associated with RBF kernels 119862 and 119892 We use agrid search and 10-fold cross-validation and find the optimal(119862 119892) pair to be (25 0000012)

In the random forest model two parameters play animportant rolemdashntree and mtry The ntree parameter is thenumber of trees in the forest and the mtry parameter isthe number of variables available for splitting at each treenode The forest error rate depends on two things (i) thecorrelation between any two trees in the forest (increasingthe correlation increases the error rate) and (ii) the strengthof each individual tree in the forest (a tree with a low errorrate is a strong classifier and increasing the strength of theindividual trees decreases the forest error rate) A largernumber of trees produce a more stable model and covariateimportance estimates but require more memory and a longerrun time Reducingmtry reduces both the correlation and thestrength Increasing it increases both Somewhere in betweenis an ldquooptimalrdquo mtry range We use a grid search and 10-foldcross-validation to find the optimal ntree and mtry We findthat the RF classifier performs best when ntree = 1000 andmtry = 91

CForest parameter optimization is similar to the randomforest A grid search with 10-fold cross-validation is used to

KNNSVMRF

CFXGBoost

00

02

04

06

08

10

Sens

itivi

ty

08 06 04 02 0010Specificity

Figure 2 ROC curves of different classifiers

determine the best ntree and mtry The optimal (ntree mtry)pair is (1050 10)

In the XGBoost model the type of model is set as tree-based models We mainly tuned two parameters eta andmax depth The eta is analogous to learning rate which isthe step size shrinkage used when getting the weights of newfeatures after each boosting step to prevent overfitting Themax depth is the maximum depth of a tree which is used tocontrol overfitting as higher depthwill make themodelmorecomplex We use a grid search and 10-fold cross-validationto tune these two parameters and the optimal values are eta= 05 and max depth = 4 Table 4 displays all the parameteroptimization details of the KNN SVM RF CF and XGBoostmodels

Using the optimal parameters we construct five classifiermodels for identifying cassette exons based on KNN SVMRF CF and XGBoost models We use sensitivity speci-ficity total accuracy and AUC to compare their predictiveperformances (shown in Table 5) and Figure 2 shows theROC curves of different classifiers represented by differentcolors It can be observed that CForest classifier significantlyoutperforms other classifiers As another tree-based modelXGBoost wins the second place We can see from Table 5that there is an obvious imbalance between sensitivity andspecificity in the RF classifier (Sn = 7149 Sp = 7872)where Sn is notably lower than Sp indicating that RF classifiercan distinguish true constitutive exons from cassette exonsbut is not so effective in recognizing true cassette exonsThis phenomenon can also be seen in KNN classifier (Sn =6703 Sp = 7557) another majority-voting principlebased model In contrast CF classifier can do both Sp isonly 172 percent higher than Sn This is the case because


Table 3 Classification results of SVM classifier with different kernels

Kernel Parameters Sensitivity () Specificity () TA ()119862 119892 119889 119903Linear 1 7023 6857 6927RBF 25 0000012 7323 7034 7169Poly 14 0015623 3 7226 7012 7108Sigmoid 357 0000564 03 7327 6978 7144

Table 4 Parameter optimization details in different classifiers

Classifier Parameters Step size in search Search range Optimal valueKNN 119870 1 1 20 7

SVM 119862 1 1 500 25119892 0000001 10minus6 1 0000012

Random forest ntree 50 50 2000 1000mtry 1 1 91 91

CForest ntree 50 50 2000 1050mtry 1 1 91 10

XGBoost eta 01 01 1 05max depth 1 1 10 4

Table 5 Classification performance of different classifiers

Classifier Sensitivity () Specificity () TA () AUCKNN 6703 7557 7045 07818SVM 7323 7034 7169 07865Random forest 7149 7872 7459 08411CForest 9590 9762 9669 09954XGBoost 8355 8537 8444 09270

CForest classifier provides an unbiased measure of variableimportance

Generally the nonlinear models (SVM RF CF andXGBoost) are superior to the linear model (KNN) andthe CForest model is the best Theoretically the nonlinearmethod performs better than the linearmethodwhen appliedto self-learning and self-adjustingThe nonlinearmodel has asimpler structure and a higher classification performance Ina forest-based model the splitting rule plays an essential rolein the prediction accuracy Traditional RF uses the majority-voting rule to split which shows a bias towards relevantpredictor variables This unfair splitting criterion tends tomake the minority categories more likely to be misclassifiedIn response to this limitation CForest is established by unbi-ased base classification trees based on a conditional inferenceframework The CForest model assembles unbiased baseclassification trees into a conditional inference frameworkand thus eliminates the errors caused by the majority-votingrule used in the KNN RF and XGBoost models In ourcase there is a notable imbalance of the amount betweencassette exons and constitutive exons Thus the CForestmodel is theoretically superior and exhibits a better predictiveperformance than the other classifiers

33 Feature Importance The CForest model provides anunbiased measure of variable importance which can used to

evaluate the importance of the features applied in the classi-fication We calculate the importance scores of all sequencefeatures and rank the features according to their importancescores to explore the most essential features that predict cas-sette exons in ourmodel Figure 3(a) shows the top 15 featuresranked according to their importance scores generated by theCForest classifier The rank of a feature indicates its impor-tance when discriminating cassette exons from constitutiveexonsWe can observe fromFigure 3(a) that in the top 15 fea-tures 12 of them are dinucleotides and trinucleotides More-over 8 of the 12 motifs (ldquogacrdquo ldquoaggrdquo ldquoacgrdquo ldquotagrdquo ldquogardquo ldquocgardquoldquoagardquo and ldquogaardquo) contain ldquoagrdquo or ldquogardquoThis indicates that ldquoagrdquoand ldquogardquo play a role in the occurrence of cassette exonevents Additionally it is speculated that length 51015840 splicesite strength and 31015840 splice site strength are the regulators ofcassette exon events in the human genome In Figure 3(b) wedisplay the top 15 features ranked according to their impor-tance scores generated in the random forest classificationmodel Random forest provides two measures to evaluatefeature importance here we use the most common oneknown as Gini impurity It can be seen from Figure 3(b) thatin the RF classifier the most important feature is the lengthwhich is the second important feature in CF classifier Similarto CF 6 of the top 15 features motifs (ldquoagrdquo ldquoggardquo ldquogacrdquoldquoagardquo ldquogaardquo and ldquogardquo) contain ldquoagrdquo or ldquogardquo Compared to CFthe difference of importance score in RF among the top 15features is obviously smaller This indicates that CF classifiercan better reveal the distinct importance of different featuresin the process of classification which might be a reason whyCF exceeds RF in the classification of cassette exons andconstitutive exons

34 Comparison with Existing Methods To demonstratethe superiority of our work we compare the performance(sensitivity specificity and total accuracy) of our methodto the methods proposed in [13ndash16] Figure 4 shows the


gacLength

aacagg

acacgtagga

cgaagacaa

gaa

cgt

5 strength

3 strength

0001 0002 0003 0004 0005 00060

Importance score

(a)

Lengthgaca

caaac

gaaaacaga

ccgac

tgga

cag

cca

2 4 6 8 10 120

Importance score

(b)

Figure 3 Importance rank of features (top 15) in different classification models (a) CForest (b) random forest

[16][15][14][13] Our method

SnSpTA

Different methods

60

65

70

75

80

85

90

95

100

Valu

e (

)

Figure 4 Performance comparison between existing methods andour method

classification accuracy of different methods measured by SnSp and TA Our method achieved 9590 percent sensitivityand 9762 percent specificity a higher level of accuracy thanthe other methods

4 Discussion

In this paper we used a comparative sequence feature anal-ysis that includes length nucleotide composition (mononu-cleotide dinucleotide and trinucleotide) GC content ter-mination codon frequency and splice site strength to dis-tinguish between cassette exons and constitutive exons Theresults clearly indicate that cassette exons the most commonAS form of human have shorter introns a lower GC contentmore termination codons and weaker splice sites thanconstitutive exons

These sequence features are combined in tandem to serveas an input of classifierWe attempted five different classifiersthat is KNN SVM random forest CForest and XGBoostto complement the discrimination task We use grid searchand 10-fold cross-validation to find the optimal parametersfor every classification model trying to make each of themachieve its best performance Finally the CForest classifieroutperforms the other four models and reaches a totalaccuracy of 9669 With the unbiased variable importancemeasure supplied in the CForest model we ranked theimportance of all features and displayed the top 15 of themwhich is supposed to explore the regulator and contributorof cassette exon events in the human genome In additionwe presented a comparison of the classification accuracybetween our method and other existing methods to validatethe superiority of our method in identifying cassette exonsIt is obvious that our work provides an improved method forhuman cassette exons identification

Additional Points

Data Availability Statement Data used in this paper can bedownloaded from httphexeventmmgucieducgi-binHEX-EventHEXEventWEBcgi and httphgdownloadsoeucscedudownloadshtml and the source code is available athttpssourceforgenetprojectscassette-exon-predictionfiles

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by grants from the Natural ScienceFoundation of Shaanxi Province of China (no 2016JQ6072)the National Natural Science Foundation of China (no71501153) Xirsquoan Science and Technology Program (noXDKD001) and China Scholarship Council (201506965039201606965057) The Boston University Center for Polymer


Studies is supported by NSF Grants PHY-1505000 CMMI-1125290 and CHE-1213217 by DTRA Grant HDTRA1-14-1-0017 and by DOE Contract DE-AC07-05Id14517

References

[1] E Petrillo M A Godoy Herz A Barta M Kalyna and A RKornblihtt ldquoLet there be light regulation of gene expression inplantsrdquo RNA Biology vol 11 no 10 pp 1215ndash1220 2014

[2] K D Raczynska A Stepien D Kierzkowski et al ldquoThe SER-RATE protein is involved in alternative splicing in Arabidopsisthalianardquo Nucleic Acids Research vol 42 no 2 pp 1224ndash12442014

[3] Q Pan O Shai L J Lee B J Frey and B J Blencowe ldquoErratumDeep surveying of alternative splicing complexity in the humantranscriptome by high-throughput sequencing (Nature Genet-ics (2008) 40 (1413-1415))rdquoNature Genetics vol 41 no 6 p 7622009

[4] Y Marquez J W S Brown C G Simpson A Barta and MKalyna ldquoTranscriptome survey reveals increased complexityof the alternative splicing landscape in Arabidopsisrdquo GenomeResearch vol 22 no 6 pp 1184ndash1195 2012

[5] H Dvinge and R K Bradley ldquoWidespread intron retentiondiversifies most cancer transcriptomesrdquo Genome Medicine vol7 no 1 article no 45 2015

[6] S Oltean and D O Bates ldquoHallmarks of alternative splicing incancerrdquo Oncogene vol 33 no 46 pp 5311ndash5318 2014

[7] M Danan-Gotthold R Golan-Gerstl E Eisenberg K Meir RKarni and E Y Levanon ldquoIdentification of recurrent regulatedalternative splicing events across human solid tumorsrdquo NucleicAcids Research vol 43 no 10 pp 5130ndash5144 2015

[8] S C-W Lee and O Abdel-Wahab ldquoTherapeutic targeting ofsplicing in cancerrdquo Nature Medicine vol 22 no 9 pp 976ndash9862016

[9] Y Christinat R Pawłowski and W Krek ldquoJSplice A high-per-formance method for accurate prediction of alternative splicingevents and its application to large-scale renal cancer transcrip-tome datardquo Bioinformatics vol 32 no 14 pp 2111ndash2119 2016

[10] E M McNally and E J Wyatt ldquoWelcome to the splice ageAntisense oligonucleotide-mediated exon skipping gains widerapplicabilityrdquo The Journal of Clinical Investigation vol 126 no4 pp 1236ndash1238 2016

[11] Y Kim H R Kim J Kim et al ldquoA novel synonymous mutationcausing complete skipping of exon 16 in the SLC26A4 gene ina Korean family with hearing lossrdquo Biochemical and BiophysicalResearch Communications vol 430 no 3 pp 1147ndash1150 2013

[12] S Tei H T Ishii H Mitsuhashi and S Ishiura ldquoAntisenseoligonucleotide-mediated exon skipping of CHRNA1 pre-mRNA as potential therapy for Congenital Myasthenic Syn-dromesrdquo Biochemical and Biophysical Research Communica-tions vol 461 no 3 pp 481ndash486 2015

[13] G Dror R Sorek and R Shamir ldquoAccurate identificationof alternatively spliced exons using support vector machinerdquoBioinformatics vol 21 no 7 pp 897ndash901 2005

[14] G Su Y-F Sun and J Li ldquoThe identification of human crypticexons based on SVMrdquo in Proceedings of the 3rd InternationalConference on Bioinformatics and Biomedical EngineeringiCBBE 2009 China June 2009

[15] L Li Q Peng A Shakoor T Zhong S Sun and X WangldquoA classification of alternatively spliced cassette exons usingAdaBoost-based algorithmrdquo in Proceedings of the 2014 IEEE

International Conference on Information and Automation ICIA2014 pp 370ndash375 China July 2014

[16] X Zhang Q Peng L Li and X Li ldquoRecognition of alternativelyspliced cassette exons based on a hybrid modelrdquo Biochemicaland Biophysical Research Communications vol 471 no 3 pp368ndash372 2016

[17] A Busch and K J Hertel ldquoHEXEvent A database of humanEXon splicing Eventsrdquo Nucleic Acids Research vol 41 no 1 ppD118ndashD124 2013

[18] L R Meyer A S Zweig A S Hinrichs et al ldquoThe UCSCGenome Browser Database update 2006rdquo Nucleic AcidsResearch vol 34 no 90001 pp D590ndashD598 2006

[19] G Yeo and C B Burge ldquoMaximum entropy modeling of shortsequence motifs with applications to RNA splicing signalsrdquoJournal of Computational Biology vol 11 no 2-3 pp 377ndash3942004

[20] L E Peterson ldquoK-nearest neighborrdquo Scholarpedia vol 4 no 2article 1883 2009

[21] VN VapnikTheNature of Statistical LearningTheory Springer2000

[22] Y Cui J Han D Zhong and R Liu ldquoA novel computationalmethod for the identification of plant alternative splice sitesrdquoBiochemical and Biophysical Research Communications vol 431no 2 pp 221ndash224 2013

[23] J Zhang J Xu X Hu et al ldquoDiagnostic method of diabetesbased on support vector machine and tongue imagesrdquo BioMedResearch International vol 2017 Article ID 7961494 9 pages2017

[24] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[25] H Kazan ldquoModeling Gene Regulation in Liver HepatocellularCarcinoma with Random Forestsrdquo BioMed Research Interna-tional vol 2016 Article ID 1035945 2016

[26] X-Y Pan Y-N Zhang and H-B Shen ldquoLarge-scale predic-tion of human protein-protein interactions from amino acidsequence based on latent topic featuresrdquo Journal of ProteomeResearch vol 9 no 10 pp 4992ndash5001 2010

[27] X Pan L Zhu Y-X Fan and J Yan ldquoPredicting protein-RNA interaction amino acids using random forest based onsubmodularity subset selectionrdquo Computational Biology andChemistry vol 53 pp 324ndash330 2014

[28] T Hothorn K Hornik and A Zeileis ldquoUnbiased recursivepartitioning a conditional inference frameworkrdquo Journal ofComputational and Graphical Statistics vol 15 no 3 pp 651ndash674 2006

[29] C Strobl A-L Boulesteix T Kneib T Augustin and A ZeileisldquoConditional variable importance for random forestsrdquo BMCBioinformatics vol 9 no 1 article 307 2008

[30] T Chen and C Guestrin ldquoXGBoost a scalable tree boostingsystemrdquo in Proceedings of the 22nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining KDD2016 pp 785ndash794 August 2016

[31] J H Friedman ldquoGreedy function approximation a gradientboosting machinerdquo The Annals of Statistics vol 29 no 5 pp1189ndash1232 2001

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of


Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 201


Molecular Biology International

GenomicsInternational Journal of


The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


BioinformaticsAdvances in

Marine BiologyJournal of



Signal TransductionJournal of


BioMed Research International

Evolutionary BiologyInternational Journal of



Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Genetics Research International


Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational



Enzyme Research



Microbiology


to exons conserved between human and mouse genomesSu et al also proposed an SVM-based approach whichuses two classifiers [14] The first classifier distinguishesauthentic exons from pseudo exons and the second classifierdistinguishes cryptic exons from constitutive and skippedexons This method did not use conservation informationand therefore can be used more widely and easily Howeverit only can achieve about 70 accuracy Li et al introducedan AdaBoost-based method to identify cassette exons andachieved 7783 accuracy [15] Recently amethod combininga gene expression programming (GEP) algorithm and arandom forest (RF) model was proposed for identifyingcassette exons in the human genome [16] and achieved asignificantly higher accuracy (9087) than previous studiesAlthough this hybrid program is a contribution that allowsthe investigation of cassette exons in human genes its resultsstill need improving This GEP+RF method [16] also cannotspeculate on the regulators of the formation mechanism ofcassette exons events

To understand why cassette exons occur and what kindof exon is more likely to be skipped we here analyze andcompare the sequence features of cassette exons with theconstitutive exons in the human genome We find thatexons with shorter length weaker splice sites lower GCcontent and more termination codons are more likely to beskipped during the splicing process We propose a hybridclassification method based on an improved random forestmodel for distinguishing cassette exons in the human genomefrom constitutive exons Computational simulation resultsindicate that our approach more accurately identifies cassetteexons than previous methods

2 Materials and Methods

21 Datasets Weuse theHEXEvent database which providesa list of human internal exons and reports all known spliceevents as our source of alternative splicing information[17] We downloaded the complete list of cassette exonsand constitutive exons of the first human chromosome fromthe HEXEvent database We also downloaded sequencesof the first human chromosome from the UCSC GenomeBrowser [18] and extracted cassette exon and constitutiveexon sequences using the lists we obtained from HEXEventOnly exons without any other kind of AS events supportedby the ESTs were used for analysis and this produced a finaldataset of 3120 cassette exons and 7316 constitutively splicedexons

22 Feature Extraction and Analysis Here we compute andcomparatively analyze the sequence features of cassette exonsand constitutive exons to find out how they differ and toextract classification features for use in the next step Wecompute 91 features including length nucleotide composition(mononucleotide dinucleotide and trinucleotide) GC con-tent termination codon frequency and splice site strengthand we use the 119905-test to analyze the differences between thetwo groups of data

Although splice site strength signal is typically modeledusing methods such as the first-order Markov model the

weight matrix model and the maximum entropy (MaxEnt)model [19] the first-order Markov model disregards connec-tions between nonadjacent positions and the weight matrixmodel is hindered by the hypothesis that different positionsare independent We thus use MaxEnt model to determinethe strength of the splice site signal

The MaxEnt model obtains the most likely distribu-tion function of observables from statistical systems bymaximizing the entropy under a set of constraints thatmodel the incomplete information about the target distri-bution The MaxEnt model consists of two distributionsthe scilicet signal model (119875+(119909)) and the decoy probabilitydistribution (119875minus(119909))

Let 119901 denote the unknown probability distribution For aspecific DNA sequence x 119901(119909) represents its probability Let119901 be the approximation of119901 where the entropy of119901 is definedas

119867(119901) = minussum119901 (119909) log2 (119901 (119909)) (1)

For a given sequence x the signal strength of 119909 is

119871 (119883 = 119909) = 119901+ (119883 = 119909)119901minus (119883 = 119909) (2)

Here 119901+(119883 = 119909) and 119901minus(119883 = 119909) are the probability of 119909 fromthe distribution of signals (+) and decoys (minus) respectively

Table 1 and Figure 1 show the feature analysis results Wefind that cassette exons are usually shorter in size and havea lower GC content more termination codons and weakersplice sites than constitutive exons The strength of a splicesite determines the skipping level in AS so weak splice sitesare suboptimal for the splicing mechanism Apparently shortlength abundance of terminal codons and weak splice signalhinder the ability of the splicing mechanism to recognizethese exons and the result is exon skipping

23 Classifiers We employ five binary classifiers to distin-guish cassette exons from constitutive exons based on the 91extracted features

231 KNN Classifier The 119896-nearest neighbors (KNN) algo-rithm is a linear pattern recognition method used forclassification and regression [20] For classification KNNcompares a test tuple and a collection of labeled examplesin a training set Each new sample in the prediction set isclassified according to the class of themajority of its 119896-nearestneighbors in the training set Parameter ldquo119870rdquo is the numberof neighbors used to determine the class and it stronglyinfluences the identification rate of the KNN model

232 SVM Classifier The support vector machine (SVM) isa supervised machine learning algorithm based on statisticallearning theory [21] SVM is primarily used to solve classi-fication and regression problems and has been successfullyapplied to bioinformatics tasks such as alternative splice sitesidentification and diagnostic method of diabetes [22 23]SVM uses a nonlinear mapping function to map originaldata into a high-dimensional feature space It then constructsa hyperplane to be the surface that discriminates betweenpositive and negative data




CA G T0

002

004

006

008

01


AA

AA

AG ACA

ACG

AGA

AGG

ATA

ATG

CAA

CAG

CCA

CCG

CGA

CGG

CTA

CTG

GA

AG

AGG

CAG

CGG

GA

GG

GG

TAG

TG TAA

TCA

TAG

TCG

TGA

TGG

TTA

TTG


0

005

01

015

02

025

03

Mea

n oc

curr

ence

0

0005

001

0015

002

0025

003

Mea

n oc

curr

ence











sum119894 119861(119905)119868 (119910119894 = 119910(119905)119894 )100381610038161003816100381610038161003816119861(119905)100381610038161003816100381610038161003816

(3)



sum119894 119861(119905)119868 (119910119894 = 119910(119905)119894120587119895|119885)

100381610038161003816100381610038161003816119861(119905)100381610038161003816100381610038161003816

(4)



VI(119905) (119883119895) =sum119894 119861(119905)119868 (119910119894 = 119910(119905)119894 )100381610038161003816100381610038161003816119861(119905)100381610038161003816100381610038161003816

minussum119894 119861(119905)119868 (119910119894 = 119910(119905)119894120587119895|119885)

100381610038161003816100381610038161003816119861(119905)100381610038161003816100381610038161003816

(5)


VI (119883119895) =sum119899119905119903119890119890119905=1 VI(119905) (119883119895)

119899119905119903119890119890 (6)






119878119899 =TP

TP + FN

119878119901 =TN

TN + FP


(7)


3 Results











KNNSVMRF

CFXGBoost

00

02

04

06

08

10

Sens

itivi

ty











SVM 119862 1 1 500 25119892 0000001 10minus6 1 0000012












gacLength

aacagg

acacgtagga

cgaagacaa

gaa

cgt

5 strength

3 strength

0001 0002 0003 0004 0005 00060

Importance score

(a)

Lengthgaca

caaac

gaaaacaga

ccgac

tgga

cag

cca

2 4 6 8 10 120

Importance score

(b)


[16][15][14][13] Our method

SnSpTA

Different methods

60

65

70

75

80

85

90

95

100

Valu

e (

)



4 Discussion



Additional Points




Acknowledgments




References








































Volume 201






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology




CA G T0

002

004

006

008

01


AA

AA

AG ACA

ACG

AGA

AGG

ATA

ATG

CAA

CAG

CCA

CCG

CGA

CGG

CTA

CTG

GA

AG

AGG

CAG

CGG

GA

GG

GG

TAG

TG TAA

TCA

TAG

TCG

TGA

TGG

TTA

TTG


0

005

01

015

02

025

03

Mea

n oc

curr

ence

0

0005

001

0015

002

0025

003

Mea

n oc

curr

ence











sum119894 119861(119905)119868 (119910119894 = 119910(119905)119894 )100381610038161003816100381610038161003816119861(119905)100381610038161003816100381610038161003816

(3)



sum119894 119861(119905)119868 (119910119894 = 119910(119905)119894120587119895|119885)

100381610038161003816100381610038161003816119861(119905)100381610038161003816100381610038161003816

(4)



VI(119905) (119883119895) =sum119894 119861(119905)119868 (119910119894 = 119910(119905)119894 )100381610038161003816100381610038161003816119861(119905)100381610038161003816100381610038161003816

minussum119894 119861(119905)119868 (119910119894 = 119910(119905)119894120587119895|119885)

100381610038161003816100381610038161003816119861(119905)100381610038161003816100381610038161003816

(5)


VI (119883119895) =sum119899119905119903119890119890119905=1 VI(119905) (119883119895)

119899119905119903119890119890 (6)






119878119899 =TP

TP + FN

119878119901 =TN

TN + FP


(7)


3 Results











KNNSVMRF

CFXGBoost

00

02

04

06

08

10

Sens

itivi

ty











SVM 119862 1 1 500 25119892 0000001 10minus6 1 0000012












gacLength

aacagg

acacgtagga

cgaagacaa

gaa

cgt

5 strength

3 strength

0001 0002 0003 0004 0005 00060

Importance score

(a)

Lengthgaca

caaac

gaaaacaga

ccgac

tgga

cag

cca

2 4 6 8 10 120

Importance score

(b)


[16][15][14][13] Our method

SnSpTA

Different methods

60

65

70

75

80

85

90

95

100

Valu

e (

)



4 Discussion



Additional Points




Acknowledgments




References








































Volume 201






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology





sum119894 119861(119905)119868 (119910119894 = 119910(119905)119894 )100381610038161003816100381610038161003816119861(119905)100381610038161003816100381610038161003816

(3)



sum119894 119861(119905)119868 (119910119894 = 119910(119905)119894120587119895|119885)

100381610038161003816100381610038161003816119861(119905)100381610038161003816100381610038161003816

(4)



VI(119905) (119883119895) =sum119894 119861(119905)119868 (119910119894 = 119910(119905)119894 )100381610038161003816100381610038161003816119861(119905)100381610038161003816100381610038161003816

minussum119894 119861(119905)119868 (119910119894 = 119910(119905)119894120587119895|119885)

100381610038161003816100381610038161003816119861(119905)100381610038161003816100381610038161003816

(5)


VI (119883119895) =sum119899119905119903119890119890119905=1 VI(119905) (119883119895)

119899119905119903119890119890 (6)






119878119899 =TP

TP + FN

119878119901 =TN

TN + FP


(7)


3 Results











KNNSVMRF

CFXGBoost

00

02

04

06

08

10

Sens

itivi

ty











SVM 119862 1 1 500 25119892 0000001 10minus6 1 0000012












gacLength

aacagg

acacgtagga

cgaagacaa

gaa

cgt

5 strength

3 strength

0001 0002 0003 0004 0005 00060

Importance score

(a)

Lengthgaca

caaac

gaaaacaga

ccgac

tgga

cag

cca

2 4 6 8 10 120

Importance score

(b)


[16][15][14][13] Our method

SnSpTA

Different methods

60

65

70

75

80

85

90

95

100

Valu

e (

)



4 Discussion



Additional Points




Acknowledgments




References








































Volume 201






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology









KNNSVMRF

CFXGBoost

00

02

04

06

08

10

Sens

itivi

ty











SVM 119862 1 1 500 25119892 0000001 10minus6 1 0000012












gacLength

aacagg

acacgtagga

cgaagacaa

gaa

cgt

5 strength

3 strength

0001 0002 0003 0004 0005 00060

Importance score

(a)

Lengthgaca

caaac

gaaaacaga

ccgac

tgga

cag

cca

2 4 6 8 10 120

Importance score

(b)


[16][15][14][13] Our method

SnSpTA

Different methods

60

65

70

75

80

85

90

95

100

Valu

e (

)



4 Discussion



Additional Points




Acknowledgments




References








































Volume 201






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology






SVM 119862 1 1 500 25119892 0000001 10minus6 1 0000012












gacLength

aacagg

acacgtagga

cgaagacaa

gaa

cgt

5 strength

3 strength

0001 0002 0003 0004 0005 00060

Importance score

(a)

Lengthgaca

caaac

gaaaacaga

ccgac

tgga

cag

cca

2 4 6 8 10 120

Importance score

(b)


[16][15][14][13] Our method

SnSpTA

Different methods

60

65

70

75

80

85

90

95

100

Valu

e (

)



4 Discussion



Additional Points




Acknowledgments




References








































Volume 201






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology


gacLength

aacagg

acacgtagga

cgaagacaa

gaa

cgt

5 strength

3 strength

0001 0002 0003 0004 0005 00060

Importance score

(a)

Lengthgaca

caaac

gaaaacaga

ccgac

tgga

cag

cca

2 4 6 8 10 120

Importance score

(b)


[16][15][14][13] Our method

SnSpTA

Different methods

60

65

70

75

80

85

90

95

100

Valu

e (

)



4 Discussion



Additional Points




Acknowledgments




References








































Volume 201






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology



References








































Volume 201






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology








Volume 201






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology

Documents

Comparative Analysis and Classification of Cassette Exons