15
Research Article ADLD: A Novel Graphical Representation of Protein Sequences and Its Application Lei Wang, 1,2 Hui Peng, 1,2 and Jinhua Zheng 1,2 1 Key Laboratory of Intelligent Computing & Information Processing, Ministry of Education, Xiangtan University, Xiangtan 411105, China 2 College of Information Engineering, Xiangtan University, Xiangtan 411105, China Correspondence should be addressed to Lei Wang; [email protected] Received 14 August 2014; Accepted 25 September 2014; Published 30 October 2014 Academic Editor: Qi Dai Copyright © 2014 Lei Wang et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. To facilitate the intuitional analysis of protein sequences, a novel graphical representation of protein sequences called ADLD (Alignment Diagonal Line Diagram) is introduced in this paper first, and then a new ADLD based method is proposed and utilized to analyze the similarity/dissimilarity of protein sequences. Comparing with existing methods, our ADLD based method is proved to be effective in the similarity/dissimilarity analysis of protein sequences and have the merits of good intuition, visuality, and simplicity. e examinations of the similarities/dissimilarities for both the 16 different ND5 proteins and the 29 different spike proteins illustrate the utility of our ADLD based approach. 1. Introduction Homology analysis is one of the hot topics in the area of protein sequences analysis. Up to now, lots of methods have been proposed for the homology analysis of protein sequences [13], and among them a useful one is the graphical representation of protein sequences, which is proved to be a powerful tool for visual comparison of protein sequences. At first, graphical representation methods were intro- duced for representation of DNA sequences on the basis of multiple dimension space [47]. Aſter obtaining the sequence invariants from the graphics, one can compare the sequences based on comparison of sequence invariants. Graphical representation methods were proposed as an alternative approach of direct comparison of DNA sequences, which are computational intensive (even those of a restricted length) [8]. Protein sequences are to some degree similar to DNA sequences, which are composed of different units. us the graphical representation methods can be extended to describe protein sequences obviously. Currently, many researchers have proposed different methods for the graphical representation of protein sequences [924]. For example, Feng and Zhang [25] suggested Zp-curve based on the hydrophobicity and charged properties of amino acid residues along the primary sequence. Randi´ c et al. [26] introduced a graphical representation of protein sequences based on a graphical representation of triplets of DNA in which the interior of a square or a tetrahedron is utilized to accommodate 64 sites for the 64 codons. Bai and Wang [27] derived a 2D graphical representation of protein sequences based on nucleotide triplet codons. Yao et al. [28] outlined a 2D graphical representation of protein sequences based on two classifications of amino acids. Abo el Maaty et al. [29] proposed a novel unique 3D graphical representation of protein sequences based on three physicochemical properties of amino acid side chains. Abo-Elkhier introduced a 3D graphical representation of protein sequence based on a right cone of a unit base and unit height on protein sequences interfaces [30]. El-Lakkani and El-Sherif [31] proposed a graphical representation of protein sequence to help similarity analysis of protein sequences based on 2D and 3D amino acid adjacency matrices. Ma et al. [32] introduced a family of Iterated Function Systems (IFS) to outline a 2D graphical representation of protein sequences. In most of these existing methods, the main draw- backs are that the higher the dimension of the protein sequence graphs, the heavier the computation complexity of Hindawi Publishing Corporation Computational and Mathematical Methods in Medicine Volume 2014, Article ID 959753, 15 pages http://dx.doi.org/10.1155/2014/959753

ADLD: A Novel Graphical Representation of Protein

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ADLD: A Novel Graphical Representation of Protein

Research ArticleADLD A Novel Graphical Representation of Protein Sequencesand Its Application

Lei Wang12 Hui Peng12 and Jinhua Zheng12

1 Key Laboratory of Intelligent Computing amp Information Processing Ministry of Education Xiangtan UniversityXiangtan 411105 China

2 College of Information Engineering Xiangtan University Xiangtan 411105 China

Correspondence should be addressed to Lei Wang phdleiwanggmailcom

Received 14 August 2014 Accepted 25 September 2014 Published 30 October 2014

Academic Editor Qi Dai

Copyright copy 2014 Lei Wang et alThis is an open access article distributed under theCreative CommonsAttribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

To facilitate the intuitional analysis of protein sequences a novel graphical representation of protein sequences called ADLD(Alignment Diagonal Line Diagram) is introduced in this paper first and then a new ADLD based method is proposed and utilizedto analyze the similaritydissimilarity of protein sequences Comparing with existing methods our ADLD based method is provedto be effective in the similaritydissimilarity analysis of protein sequences and have the merits of good intuition visuality andsimplicity The examinations of the similaritiesdissimilarities for both the 16 different ND5 proteins and the 29 different spikeproteins illustrate the utility of our ADLD based approach

1 Introduction

Homology analysis is one of the hot topics in the areaof protein sequences analysis Up to now lots of methodshave been proposed for the homology analysis of proteinsequences [1ndash3] and among themauseful one is the graphicalrepresentation of protein sequences which is proved to be apowerful tool for visual comparison of protein sequences

At first graphical representation methods were intro-duced for representation of DNA sequences on the basis ofmultiple dimension space [4ndash7] After obtaining the sequenceinvariants from the graphics one can compare the sequencesbased on comparison of sequence invariants Graphicalrepresentation methods were proposed as an alternativeapproach of direct comparison of DNA sequences which arecomputational intensive (even those of a restricted length)[8] Protein sequences are to some degree similar to DNAsequences which are composed of different units Thusthe graphical representation methods can be extended todescribe protein sequences obviously

Currently many researchers have proposed differentmethods for the graphical representation of proteinsequences [9ndash24] For example Feng and Zhang [25]suggested Zp-curve based on the hydrophobicity and

charged properties of amino acid residues along theprimary sequence Randic et al [26] introduced a graphicalrepresentation of protein sequences based on a graphicalrepresentation of triplets of DNA in which the interiorof a square or a tetrahedron is utilized to accommodate64 sites for the 64 codons Bai and Wang [27] derived a2D graphical representation of protein sequences basedon nucleotide triplet codons Yao et al [28] outlined a2D graphical representation of protein sequences basedon two classifications of amino acids Abo el Maaty et al[29] proposed a novel unique 3D graphical representation ofprotein sequences based on three physicochemical propertiesof amino acid side chains Abo-Elkhier introduced a 3Dgraphical representation of protein sequence based on a rightcone of a unit base and unit height on protein sequencesinterfaces [30] El-Lakkani and El-Sherif [31] proposeda graphical representation of protein sequence to helpsimilarity analysis of protein sequences based on 2D and 3Damino acid adjacency matrices Ma et al [32] introduced afamily of Iterated Function Systems (IFS) to outline a 2Dgraphical representation of protein sequences

In most of these existing methods the main draw-backs are that the higher the dimension of the proteinsequence graphs the heavier the computation complexity of

Hindawi Publishing CorporationComputational and Mathematical Methods in MedicineVolume 2014 Article ID 959753 15 pageshttpdxdoiorg1011552014959753

2 Computational and Mathematical Methods in Medicine

the methods or the lower the recognition degree of the pro-tein sequence graphs For example in the methods proposedin [26 28] the main drawback is that the lines will cross eachother which will decrease the visibility of the graphics In themethods proposed in [29ndash31] the main drawbacks are thatthe 3D graphics seem to be more complex and have lowervisibility than the 2D graphics and in addition to obtain thesequence invariants from the graphics complex matrixes arerequired to be constructed which need much computationand storage

Sequence alignment is a way of arranging the sequencesof DNA RNA or protein to identify regions of similaritythat may be a consequence of functional structural orevolutionary relationships between the sequences [33] Upto now there are many kinds of algorithms having beenimplemented for sequence alignment [34ndash37] These meth-ods are usually efficient but complex and time consumingComparing with the alignment methods existing graphicalrepresentation methods can also display the inner structureof the protein sequences and can be utilized to find the sim-ilaritydissimilarity more visible according to their graphicsIn this paper we proposed a novel method for analyzing thesimilaritydissimilarity by combining the idea of the sequencealignment and the graphical representation methods to somedegree avoid the weakness of both of these two methods

Principal components analysis (PCA) is a standard toolin multivariate data analysis to reduce the number of dimen-sions which has been proved to be effective in the processof protein sequence analysis [38ndash40] Therefore in order toovercome the main drawbacks of existing methods in thispaper a novel graphical representation of protein sequencescalled ADLD (Alignment Diagonal Line Diagram) is intro-duced based on PCA and then a newADLD basedmethod isproposed and utilized to analyze the similaritydissimilarityof protein sequences And in addition to validate theeffectiveness of our ADLD based method we adopt it toanalyze the similaritydissimilarity of both the 16 differentND5 proteins and the 29 different spike proteins respectivelywhich are widely used as the test data [16ndash26] The analysisresults show that our method is not only visual intuitionaland effective in the similaritydissimilarity analysis of proteinsequences but also quite simple since there are no highdimensional matrixes required to be constructed

2 Materials and Methods

21 Procedure of OurMethod for Analysis of Protein SequencesIn this section we will illustrate the overall procedures of ourmethod for analyzing protein sequences as follows at first

(1) Select the same 9 different properties for each aminoacid and construct a 20 times 9 matrix as the input dataof the PCA algorithm on the basis of total 20 differentamino acids

(2) According to the PCA algorithm we can obtain aunique feature for each amino acid

(3) For each protein sequence in the test data we willreplace each amino acid in the protein sequence

with its corresponding unique feature and then wecan transform the protein sequence into a numericalsequence

(4) For any two numerical sequences we can drawa graph named ADLD and then abstract somenumerical characteristics of it which can be utilizedto analyze the similaritydissimilarity of these twosequences

Next in Sections 22ndash26 we will introduce the detailsof constructing the ADLDs and obtaining some of thenumerical characteristics of them In Section 31 we will givethe method for constructing the similaritydissimilarity ofour test sequence groups

22 AminoAcids andTheir Properties Proteins are composedof 20 different amino acids and these amino acids have manydifferent physicochemical and biological properties such asthe molecular weight (mW) hydropathy index (hI) the pKavalue for terminal amino acid groups COOH (pK1) the pKavalue for terminal amino acid groupsNH

3

+ (pK2) isoelectricpoint (pI) solubility (119878) the number of triplet codons (cN)frequency of human proteins (119865) and van der Waals radiusof side chains (vR) The names and symbols of the 20 aminoacids and the value of their 9 major properties are illustratedin Table 1

23 Principal Components Analysis Principal componentsanalysis (PCA) is a common technique for dimensionalityreduction and pattern recognition in datasets of high dimen-sion [41] The main purposes of PCA are the analysis ofdata to identify patterns and finding patterns to reduce thedimensions of the dataset with minimal loss of informationThe general steps of conducting PCA are as follows

Step 1 For119898 samples 1198831 1198832 119883

119898 suppose that each119883

119894

has 119899 components 1199091198941 1199091198941 119909

119894119899 let119883

119894= (1199091198941 1199091198941 119909

119894119899)

for 119894 isin 1 2 119898 and then construct an 119898 times 119899 matrix Xaccording to the following formula first

X =

[[[[

[

1198831

1198832

119883119898

]]]]

]

=

[[[[

[

11990911

11990912

sdot sdot sdot 1199091119899

11990921

11990922

sdot sdot sdot 1199092119899

1199091198981

1199091198982

sdot sdot sdot 119909119898119899

]]]]

]

(1)

Next based on thematrixX construct the corresponding119898 times 119899 standardized matrix Xlowast according to the followingformula

Xlowast =

[[[[[[

[

119883lowast

1

119883lowast

2

119883lowast

119898

]]]]]]

]

=

[[[[[[

[

119909lowast

11119909lowast

12sdot sdot sdot 119909

lowast

1119899

119909lowast

21119909lowast

22sdot sdot sdot 119909

lowast

2119899

119909lowast

1198981119909lowast

1198982sdot sdot sdot 119909lowast

119898119899

]]]]]]

]

(2)

where 119883lowast

119894= (119909lowast

1198941 119909lowast

1198942 119909

lowast

119894119899) 119909lowast119894119895

= (119909119894119895minus 119909119895)radicvar(119909

119895) 119909119895=

(1119899)sum119899

119894=1119909119894119895 and var(119909

119895) = (1(119899 minus 1))sum

119899

119894=1(119909119894119895

minus 119909119895)2 for

119894 isin 1 2 119898 and 119895 isin 1 2 119899

Computational and Mathematical Methods in Medicine 3

Table 1 The full list of 20 amino acids and the value of their 9 different properties

Amino acid Symbol mW hI pK1 pK2 pI 119878 cN 119865 () vRAlanine A 89079 18 234 969 601 1672 4 78 67Cysteine C 121145 25 196 1028 507 0 2 19 86Aspartic acid D 133089 minus35 188 96 277 5 2 53 91Glutamic acid E 147116 minus35 219 967 322 85 2 63 109Phenylalanine F 165177 28 183 913 548 276 2 39 135Glycine G 75052 minus04 234 96 597 2499 4 72 48Histidine H 155141 minus32 182 917 759 0 2 23 118Isoleucine I 13116 45 236 968 602 345 3 53 124Lysine K 14617 minus39 218 895 974 739 2 59 135Leucine L 13116 38 236 96 598 217 6 91 124Methionine M 149199 19 228 921 574 562 1 23 124Asparagine N 132104 minus35 202 88 541 285 2 43 96Proline P 115117 16 199 1096 648 1620 4 52 90Glutamine Q 146131 minus35 217 913 565 72 2 42 114Arginine R 174188 minus45 217 904 1076 8556 6 51 148Serine S 105078 minus08 221 915 568 422 6 68 73Tyrosine T 119105 minus07 211 962 587 132 4 59 93Valine V 117133 42 232 962 597 581 4 66 105Tryptophan W 204213 minus09 238 939 589 136 1 14 163Threonine Y 181176 minus13 22 911 566 04 2 32 141

Step 2 Based on thematrixXlowast construct the 119899times119899 correlationmatrix R according to the following formula

R =

[[[[

[

1198771

1198772

119877119899

]]]]

]

=

[[[[

[

11990311

11990312

sdot sdot sdot 1199031119899

11990321

11990322

sdot sdot sdot 1199032119899

1199031198991

1199031198992

sdot sdot sdot 119903119899119899

]]]]

]

(3)

where we can find that 119903119894119895

= sum119898

119896=1(119909lowast

119896119894minus 119909lowast

119894)(119909lowast

119896119895minus 119909lowast

119895)

radicsum119898

119896=1(119909lowast

119896119894minus 119909lowast

119894)2sum119898

119896=1(119909lowast

119896119895minus 119909lowast

119895)2 for 119894 isin 1 2 119899 and

119895 isin 1 2 119899

Step 3 From the correlationmatrixR obtain its 119899 eigenvalues1205821ge 1205822ge sdot sdot sdot ge 120582

119899gt 0 and the corresponding 119899 eigenvectors

1198861=

[[[[

[

11988611

11988621

1198861198991

]]]]

]

1198862=

[[[[

[

11988612

11988622

1198861198992

]]]]

]

119886119899=

[[[[

[

1198861119899

1198862119899

119886119899119899

]]]]

]

(4)

respectively And from now on we can obtain 119899 principalcomponents 119865

119894for 119894 isin 1 2 119899 as follows

119865119894= 1198861119894119883lowast

1+ 1198862119894119883lowast

2+ sdot sdot sdot + 119886

119899119894119883lowast

119899 (5)

Step 4 For each principal component 119865119894for 119894 isin 1 2 119899

obtain its contribution rate CR119894and accumulated contribution

rate ACR119894according to the following formulas respectively

CR119894=

120582119894

sum119899

119896=1120582119896

(6)

ACR119894=

119894

sum

119896=1

CR119896 (7)

Generally in order to lower the computation complexitywe can keep only the first 119905 (119905 le 119899) principal components1198651 1198652 119865

119905 where the accumulated contribution rate of

the 119905th principal component 119865119905shall satisfy the fact that

ACR119905ge 85

Step 5 For 119895 isin 1 2 119905 let

119865119895=

[[[[[[[

[

119865119895

1

119865119895

2

119865119895

119898

]]]]]]]

]

(8)

Then for each 119894 isin 1 2 119898 we can obtain the total scoreof the 119894th sample as follows

TotalScore (119894) =

119905

sum

119896=1

119865119896

119894times CR119896 (9)

4 Computational and Mathematical Methods in Medicine

Table 2 The 9 eigenvalues (120582) of R and the contribution rates (CR)and the accumulative contribution rates (ACR) of the 9 principalcomponents obtained by conducting PCA of the 20 amino acids

Number 120582 CR ACR1 32237 03582 035822 19132 02126 057083 14048 01561 072694 11876 01320 085885 04959 00551 091396 04467 00496 096357 01992 00221 098578 01218 00135 099929 00071 00008 10000

Table 3The 4 eigenvectors 1198861 1198862 1198863 1198864 corresponding to the first

4 eigenvalues in Table 2

1198861

1198862

1198863

1198864

05036 01436 00571 02158minus02454 minus01875 02304 06547minus01634 01820 06298 02288minus03101 minus01883 minus03964 0507100702 06464 minus00786 00532minus01665 04465 minus05280 01877minus03872 03931 01003 minus00532minus04377 01844 02544 minus0227304349 02643 01738 03495

24 PCA of the Amino Acids Observing Table 1 if weconsider the 20 amino acids as 20 different samples and the9 properties of each amino acid as its 9 components thenaccording to the general steps of conducting PCA illustratedin Section 23 we can obtain a 20 times 9 matrix X and itsstandardized matrix Xlowast a 9 times 9 correlation matrix R and9 principal components 119865

1 1198652 119865

9 And therefore as

illustrated in Table 2 we can obtain the 9 eigenvalues of Rand the contribution rates and the accumulative contributionrates of the 9 principal components 119865

1 1198652 119865

9 respec-

tivelyFrom Table 2 we can see that the accumulative con-

tribution rate of the first 4 principal components amountsto 08588 (=8588) which is already bigger than 85Therefore we can keep the first 4 principal components onlyLet 120582

1 1205822 1205823 1205824 be the 4 eigenvalues corresponding to the

first 4 principal components respectively then as illustratedin Table 3 we can obtain the 4 eigenvectors 119886

1 1198862 1198863 1198864

corresponding to the 4 eigenvalues 1205821 1205822 1205823 1205824 separately

Based on Table 3 we can obtain the first 4 principalcomponents 119865

1 1198652 1198653 1198654 as follows

1198651= 05036119883

lowast

1minus 02454119883

lowast

2minus 01634119883

lowast

3

minus 03101119883lowast

4+ 00702119883

lowast

5minus 01665119883

lowast

6

minus 03872119883lowast

7minus 04377119883

lowast

8+ 04349119883

lowast

9

Table 4 The total scores of the 20 amino acids

Symbols of amino acids Total scoresA minus09324C minus05985D minus06709E minus02296F 04298G minus11780H 04476I 01435K 07868L minus01205M 05735N minus00242P minus09822Q 02848R 11169S minus07077T minus04525V minus02643W 14729Y 09050

1198652= 01436119883

lowast

1minus 01875119883

lowast

2+ 01820119883

lowast

3

minus 01883119883lowast

4+ 06464119883

lowast

5+ 04465119883

lowast

6

+ 03931119883lowast

7+ 01844119883

lowast

8+ 02643119883

lowast

9

1198653= 00571119883

lowast

1+ 02304119883

lowast

2+ 06298119883

lowast

3

minus 03964119883lowast

4minus 00786119883

lowast

5minus 05280119883

lowast

6

+ 01003119883lowast

7+ 02544119883

lowast

8+ 01738119883

lowast

9

1198654= 02158119883

lowast

1+ 06547119883

lowast

2+ 02288119883

lowast

3

+ 05071119883lowast

4+ 00532119883

lowast

5+ 01877119883

lowast

6

minus 00532119883lowast

7minus 02273119883

lowast

8+ 03495119883

lowast

9

(10)

Observing the above 4 formulas it is easy to find thatthere are three big coefficients in the first formula whichare 05036 (corresponding to mW) 04377 (corresponding to119865) and 04349 (corresponding to vR) respectivelyThereforeit means that the three properties such as mW 119865 andvR will have a major role in the first principal component1198651 Similarly we can also know that the three properties

such as pI 119878 and cN will have a major role in the secondprincipal component 119865

2 the third principal component 119865

3is

mainly determined by pK1 and 119878 and the fourth principalcomponent 119865

4is closely linked with hI and pK2 and so forth

Hence we can obtain the total scores of the 20 amino acids asillustrated in Table 4 according to formula (9)

Computational and Mathematical Methods in Medicine 5

25 Numerical Sequences of Protein Sequences LetΩ = AC

DE FGH IK LMNPQR STVWY and supposethat Ψ = 119901

111990121199013 119901119873represents a protein sequence with

119873 amino acids where 119901119894

isin Ω for 119894 isin 1 2 119873 thenwe can obtain a numerical sequence 119878

Ψ= (1199051 1199052 119905

119873)

corresponding to the protein sequence Ψ through replacingeach amino acid 119901

119894in Ψ with its corresponding value of

TotalScore(119894) for 119894 isin 1 2 119873For example consider the following 3 abbreviated protein

sequences

Hu = MTMHTTMTTLGor = MTMYATMTTLOpo = MKVINISNTM

According to the above descriptions and Table 4 thenwe can obtain their corresponding numerical sequences asfollows119878Hu = 05735 minus04525 05735 04476 minus04525 minus04525

05735 minus04525 minus04525 minus01205

119878Gor = 05735 minus04525 05735 09050 minus09324 minus04525

05735 minus04525 minus04525 minus01205

119878Opo = 05735 07868 minus02643 01435 minus00242 01435

minus07077 minus00242 minus04525 05735

(11)

26 ASDs and ADLDs of Protein Sequence Pairs For agiven protein sequence pair (119904

1 1199042) suppose that the protein

sequence 1199041includes 119873

1amino acids 119904

2includes 119873

2amino

acids and 1198731

⩾ 1198732 then in order to measure the

similaritydissimilarity between them in this section we willpresent a new method called Alignment Scatter Diagram(ASD) to plot the two sequences into a scatter diagram firstAnd for convenience we call the points in the ASD thealignment-plots (APs) The ASD of the protein sequence pair(1199041 1199042) can be obtained through the following steps

Step 1 According to the method given in Section 25 trans-late the protein sequence pair (119904

1 1199042) into two numerical

sequences with the same length as follows

1198781= 1199051 1199052 119905

1198731

1198782= 1199051 1199052 119905

1198732

0 0 0⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

1198731minus1198732

(12)

Step 2 Let 119908 be the alignment width (AW) of the proteinsequence pair (119904

1 1199042) that is let 119904

1= 1199011 1199012 1199013 119901

1198731

1199042=

1199021 1199022 1199023 119902

1198732

then for any amino acid 119901119894in the protein

sequence 1199041 we will compare it with these 2119908 + 1 amino

acids 119902119894minus119908

119902119894minus1

119902119894 119902119894+1

119902119894+119908

in the protein sequence1199042 and then 119908 can be simply defined as follows

119908 =

120585 if 1198731minus 1198732le 120585

1198731minus 1198732 else

(13)

where 120585 gt 0 is a given threshold to guarantee that the AWof the protein sequence pair (119904

1 1199042) will not be too small to

expose the association of the inner structures of the proteinsequence pair (119904

1 1199042) In actual applications we suggest that

120585 shall be no less than 10

Step 3 Let 120576 gt 0 be the dissimilarity degree (DD) of twoamino acids that is if 120576 = 0 then it means that the two aminoacids are the same otherwise it means that the two aminoacids are different from each other to some degree and thenthe APs in the ASD of the protein sequence pair (119904

1 1199042) can

be briefly defined as follows

119860120576

119894119895= Θ (

10038161003816100381610038161003816119905119894minus 119905119895

10038161003816100381610038161003816minus 120576) (14)

where 119894 isin 1 2 1198731 119895 isin 1 2 119873

1 and Θ is a

Heaviside function which can be defined as follows

Θ (119909) =

1 if 119909 le 0

0 else(15)

Thereafter we can obtain an 1198731times 1198731alignment matrix

(AM) as follows

AM = (119860120576

119894119895)1198731times1198731

(16)

Step 4 For the1198731times1198731elements in the alignmentmatrix AM

we can plot points on 119894-119895 plane for these elements in the AMwith 119860

120576

119894119895= 1 and |119894 minus 119895| le 119908 And for convenience we call

the obtained graph the Alignment Scatter Diagram (ASD) ofthe protein sequence pair (119904

1 1199042)

For example considering the three 120573-globin pro-tein sequences of chimpanzee [GenBank AAA163341]human [GenBank CAA262041] and gorilla [GenBankCAA434211] obtained from the GenBank respectively weillustrate the ASDs of the 120573-globin protein sequence pair(chimpanzee human) and the 120573-globin protein sequencepair (human gorilla) in Figures 1(a) and 1(b) separately whileletting 120576 = 0

From Figure 1 it is easy to see that there are lots of disor-dered points in these ASDs which will lower the visuality ofthe ASDs remarkably and obstruct us fromdistinguishing thesimilaritydissimilarity between the protein sequence pairsintuitively while observing these ASDsTherefore in order toimprove the intuition of theASD wewill propose a simplifiedvariant diagram of the ASD which is called the AlignmentDiagonal Line Diagram (ADLD)

For convenience in an ASD we call its main diagonalline the artery tracks (ATs) and the lines parallelling to itsmain diagonal line the by-path tracks (BTs) respectively Andin addition we define a set consisting with no less than 120575

consecutive APs on the AT or BTs as a CAPS where 120575 ge 1

is a given thresholdFor a given CAPS caps

1 if there is no CAPS caps

2

satisfying caps1

sub caps2 then we call the caps

1a maximum

CAPS And for convenience we call the line formed byconnecting all of the APs in a maximum CAPS a similarfragment (SF) and simultaneously we call all of the APs onthe AT but not on any SFs the free points (FPs)

6 Computational and Mathematical Methods in Medicine

Obviously in an ASD if keeping all of the SFs and FPsonly and omitting all those other APs then we will obtain asimplified variant diagram of the ASD and for conveniencewe call it the Alignment Diagonal Line Diagram (ADLD)Apparently if 120575 = 1 then an ADLD will degenerate into anASD Therefore in actual applications we suggest that 120575 willbe no less than 2 And particularly in order to find moreaccurate SFs in the ADLD of a protein sequence pair thelonger the protein sequences in the protein sequence pair arethe bigger the value of 120575 shall be

For convenience of analysis in an ADLD suppose thatthere are 119870

1different SFs and 119870

2different FPs on its AT

119870 different BTs locating above its AT and 119870 different BTslocating below its AT then we get the following

(1) For these 1198701different SFs and 119870

2different FPs

on the AT of the ADLD we will number these1198701SFs and 119870

2FPs from left to right and utilize

ASF1ASF2 ASF1198701 and FP1 FP2 FP1198702 torepresent these 119870

1SFs and 119870

2FPs separately And

in addition we would also call these SFs on the AT ofthe ADLD the ASFs

(2) For these 119870 different BTs locating above the AT wewill number these BTs from down to up and utilizeBT1BT2 BT

119870 to represent these BTs separately

and for these 119870 different BTs locating below theAT we will number these BTs from up to down andutilize BT

minus1BTminus2

BTminus119870

to represent these BTsseparately

(3) For each BT119897 where 119897 isin 1 2 119870 suppose

that there are 1198703different SFs on the BT

119897 then we

will number these 1198703SFs from left to right and

utilize BSF1119897BSF2119897 BSF1198703

119897 to represent these SFs

separately And in addition we would also call theseSFs on the BTs of the ADLD the BSFs

According to the above assumptions in Figure 2 we showthe two ADLDs corresponding to the ASDs illustrated inFigures 1(a) and 1(b) while letting 120575 = 3 And in additionto make the ADLDs more visual and intuitional in Figure 2we use the red ldquolowastrdquo to represent the FPs on the AT and the bluelines to represent the SFs on the AT or BTs

From Figure 2(a) it is easy to see that there are two SFsin the ADLD of the sequence pair (chimpanzee human)one is ASF1 that is the line segment from the point (1 1)

to the point (32 32) and the other is BSF1minus4 that is the

line segment from the point (35 31) to the point (125 121)And in addition there are totally 6 FPs in the ADLD whichare FP1(46 46) FP2(66 66) FP3(111 111) FP4(114 114)FP5(115 115) and FP6(123 123) respectively

Observing Figure 2(b) we can easily find that there arealso two SFs in the ADLD of the sequence pair (humangorilla) But different from that in Figure 2(a) the two SFsin Figure 2(b) are both ASFs one is ASF1 that is the linesegment from the point (1 1) to the point (104 104) andthe other is ASF2 that is the line segment from the point(106 106) to the point (121 121) And in addition the twoASFs in Figure 2(b) are separated by one gap and there existno FPs or BSFs on the AT or BTs

Through analysis we can know that for a given proteinsequence pair if there exist some deletions or insertions ofamino acid segments between the two protein sequencesthen there will exist some misalignments of SFs in theirADLD that is some ASFs on the AT will be transformedinto BSFs on some BTs And in addition if there exist somesubstitutions of the amino acids between the two proteinsequences then in their ADLD there will exist some gapsbetween two neighboring SFs or FPs on the AT Furthermoreif there exist some insertions deletions or substitutions of theamino acid segments at the end of the two protein sequencesthen in their ADLD there will exist no SFs or FPs on the ATor BTs

From the above descriptions it is easy to know thatthe ADLD of any given protein sequence pair obtainedby our above proposed method reflects some inner andspecific differences between these two protein sequences inthe given protein sequence pair which may be useful in thesimilaritydissimilarity analysis of protein sequence pairs

3 Results and Discussion

31 Method for SimilarityDissimilarity Analysis of ProteinSequences Based on the ADLDs According to the aboveanalysis we have known that the ADLDs may be useful inanalyzing the differences of the inner structures of proteinsequence pairs In this section we will show how to utilizethe ADLDs to analyze the similaritydissimilarity of a groupof protein sequences

Generally suppose that there are 119873 protein sequencesΨ1 Ψ2 Ψ

119873 then while applying the ADLDs to analyze

the similaritydissimilarity of these119873 sequences the similar-itydissimilarity matrix of these119873 sequences can be obtainedthrough the following steps

Step 1 According to the method given in Section 25 trans-form these 119873 protein sequences into119873 numerical sequences1198781 1198782 119878

119873

Step 2 For a given protein sequence pair Ψ119886 Ψ119887 119886 isin

1 2 119873 119887 isin 1 2 119873 we can obtain their ADLDthrough adopting the method proposed in Section 26 andthen we can obtain all of the SFs (including ASFs and BSFs)and FPs in the ADLD Hence we can obtain the lengths oftheseASFs the lengths of these BSFs and the number of theseFPs respectively

Step 3 Suppose that there are totally 1198711different ASFs

such as ASF1ASF2 ASF1198711 1198712different BSFs such

as BSF11198971

BSF21198972

BSF11987121198971198712

and 1198713different FPs such as

FP1 FP2 FP1198713 in the ADLD And in addition for eachASF119894 and BSF119895

119897119895

let their length be length119894 and length119895

respectively where 119894 isin 1 2 1198711 and 119895 isin 1 2 119871

2

then we can define the similarity degree (SD) of Ψ119886 Ψ119887 as

follows

SD (Ψ119886 Ψ119887) =

1198711

sum

119894=1

length119894 +1198712

sum

119895=1

length119895+ 1198713 (17)

Computational and Mathematical Methods in Medicine 7

150

100

50

0150100500

(a)

150

100

50

0150100500

(b)

Figure 1 (a)TheASD of the 120573-globin protein sequence pair (chimpanzee human) with 120585 = 12 (b) the ASD of the 120573-globin protein sequencepair (human gorilla) with 120585 = 16

150

100

50

0150100500

(a)

150

100

50

0

150100500

(b)

Figure 2 (a) The ADLD of the protein sequence pair (chimpanzee human) (b) the ADLD of the protein sequence pair (human gorilla)

And therefore according to these 119873 protein sequencesΨ1 Ψ2 Ψ

119873 we can obtain an 119873 times 119873 matching matrix

(MM) as follows

MM = [119889119894119895]119873times119873

(18)

where

119889119894119895

=

SD (Ψ119894 Ψ119895) if 119894 ge 119895

0 else

for 119894 isin 1 2 119873 119895 isin 1 2 119873

(19)

Step 4 Based on the matching matrixMM and all of its com-ponents 119889

119894119895 where 119894 isin 1 2 119873 and 119895 isin 1 2 119873 then

we can obtain an 119873 times 119873 similaritydissimilarity matrix (SM)of these 119873 protein sequences Ψ

1 Ψ2 Ψ

119873 as follows

SM = [119904119894119895]119873times119873

(20)

where

119904119894119895

=

1 minus Λ(

119889119894119895

119889119894119894

) if 119894 ge 119895

0 else

Λ(

119889119894119895

119889119894119894

) =

1 if119889119894119895

119889119894119894

ge 1

119889119894119895

119889119894119894

else

for 119894 isin 1 2 119873 119895 isin 1 2 119873

(21)

According to the above steps we present an examplethrough implementing the ADLDs to analyze the similar-itydissimilarity of 16 ND5 proteins (illustrated in Table 5)while letting 120575 = 3 and illustrate the results of similar-itydissimilarity matrix in Table 6

Observing Table 6 it is easy to find that there are somesimilar pairs such as (c-chim pi-chim) with the distance00510 (human c-chim) with the distance 00814 (human

8 Computational and Mathematical Methods in Medicine

Table 5 The basic information of 16 ND5 protein sequences

Number Name Abbreviation Access number Length1 Human Human ADT804301 6032 Gorilla Gorilla NP 008222 6033 Pigmy chimpanzee Pi-chim NP 008209 6034 Common chimpanzee C-chim NP 008196 6035 Fin-whale Fin-whale NP 006899 6066 Blue-whale Blue-whale NP 007066 6067 Rat Rat AP 0049021 6108 Mouse Mouse NP 904338 6079 Opossum Opossum NP 007105 60210 Sheep Sheep ABW229031 60611 Goat Goat BAN592581 60612 Lemur Lemur CAD134311 60313 Cattle Cattle ADN119021 60614 Hare Hare CAD132911 60315 Gallus Gallus BAE160361 60516 Rabbit Rabbit NP 0075591 603

Sheep

Goat

Cattle

Fin-whale

Blue-whale

Lemur

Hare

Rabbit

Gorilla

Human

Pi-chim

C-chim

Rat

Mouse

Opossum

Gallus

00383

00468

00255

00255

00162

00162

00942

00942

02169

00356

00356

01084

00540

00419

02311

00419

00855

00128

00184

00085

00665

01080

00477

00770

00243

00176

00288

00163

00458

00142

000005010015020

Figure 3 The phylogenetic tree of the 16 species based on the ADLDs based method

pi-chim) with the distance 00720 (gorilla c-chim) with thedistance 00865 (gorilla pi-chim) with the distance 00833and (fin-whale blue-whale) with the distance 00324 Andamong them the opossum seems to be a peculiar mammalsince the shortest distance between it and the remaining

mammals is more than 04023 Obviously the result isconsistent with the fact that opossum is the most remotespecies from the remaining mammals

Additionally gallus seems to be more peculiar thanopossum since the shortest distance between it and

Computational and Mathematical Methods in Medicine 9

Table6Th

esim

ilaritydissim

ilaritymatrix

forthe

16ND5proteins

basedon

theA

DLD

sbased

metho

d

Hum

anGorilla

Pi-chim

C-chim

Fin-whale

Blue-w

hale

rat

mou

seop

ossum

sheep

goat

lemur

cattle

hare

gallu

srabbit

Hum

an000

00Gorilla

01111

00000

Pi-chim

007

20008

33000

00C-

chim

008

14008

65005

10000

00Fin-whale

03396

03285

03222

03301

00000

Blue-w

hale

03474

03333

03285

03301

00324

000

00Ra

t03693

03622

03636

03716

03333

03381

000

00Mou

se03740

03686

03716

03748

03317

03333

01883

000

00Opo

ssum

044

7604551

04290

044

1804515

04519

04513

044

79000

00Sheep

03020

02933

02871

02951

02023

02067

03149

03219

04121

00000

Goat

03036

02901

02871

02919

01958

02147

03166

03468

04202

007

12000

00Lemur

02989

02708

02839

02967

02557

02724

03166

03670

040

5502087

02317

000

00Ca

ttle

03114

03045

03046

03062

01958

02051

03149

03173

04235

009

0601254

02184

000

00Hare

03146

03157

03062

03046

02832

02788

03166

03421

04023

02217

02508

02053

02532

000

00Gallus

04726

04920

04737

05008

044

50044

2304903

04743

04691

04239

04524

046

8004183

046

6000000

Rabb

it03255

03189

03222

03142

02896

02756

03084

03390

04332

02184

02603

02282

02612

008

37044

34000

00

10 Computational and Mathematical Methods in Medicine

the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal

Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences

32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species

Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3

From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree

To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7

For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points

Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)

Table 7 The basic information of 29 spike proteins

Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255

From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method

Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively

Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang

33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we

Computational and Mathematical Methods in Medicine 11

BCoVE

BCoVL

BCoVM

BCoVQ

HCoVOC43

MHVJHM

MHVP

MHVA

MHVM

TGEVG

TGEV

PEDVC

PEDV

TOR2

TaiwanTC1

FRA

BJ01

GD01

GZ02

civet007

A022

civet010

PC4137

GD03T0013

PC4127

PC4205

IBV

IBVBJ

IBVC

00000

00000

00000

00000

00369

00026

00026

00022

00022

00019

00559

00221

00019

00950

00950

01221

00010

00004

00015

00004

00004

00006

00004

00012

00012

00011

00006

00004

00004

04350

04350

00002

00002

00006

00005

00020

00005

00019

00018

00011

00202

00116

00112

00044

00040

04560

00232

00337

03498

03309

0027203461

00713

00230

00049

00052

Group II

Group I

Group III

SARS CoVs

Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5

will further discuss the intuition and visuality of the ADLDsin detail

From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3

Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)

Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total

12 Computational and Mathematical Methods in Medicine

PEDVC

PEDV

GD03T0013

BJ01

GD01

MHVA

PC4205

GZ02

TOR2

MHVM

PC4137

FRA

PC4127

TaiwanTC1

MHVP

A022

civet007

civet010

TGEVG

TGEV

BCoVM

BCoVQ

HCoVOC43

IBVC

MHVJHM

BCoVE

BCoVL

IBVBJ

IBV

00000

00000

00000

00000

00013

00006

00006

00001

00001

00000

00012

00002

00001

00010

00013

00010

00000

00000

00000

00000

00000

00000

00001

00001

00000

00000

00000

00001

0000000000

00000

00000

00006

00000

00000

00001

00000

00000

00001

00002

00001

00000

00002

00004

00002

00003

00003

00008

00006

00008

00067

00022

00013

00013

00008

00042

Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang

length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short

And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively

Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs

Computational and Mathematical Methods in Medicine 13

Fin-whale

Rabbit

Blue-whale

Sheep

Goat

Hare

Rat

Lemur

Mouse

Cattle

Opossum

Gorilla

Human

Pi-chim

C-chim

Gallus

00006

00037

00004

00004

00001

00002

00000

00001

00074

00002

00015

00000

00001

00011

00183

00001

00006

00006

00004

00005

00002

00005

00031

00008

00024

00020

00039

00060

00024

00086

Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang

700

600

500

400

300

200

100

07006005004003002001000

(a)

700

600

500

400

300

200

100

07006005004003002001000

(b)

700

600

500

400

300

200

100

07006005004003002001000

(c)

Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)

14 Computational and Mathematical Methods in Medicine

in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)

Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields

4 Conclusions

In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)

References

[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999

[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994

[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004

[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004

[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007

[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007

[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008

[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001

[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009

[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009

[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006

[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007

[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011

[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009

[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007

[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008

[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009

[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010

[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010

[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008

[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012

Computational and Mathematical Methods in Medicine 15

[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012

[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012

[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008

[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002

[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004

[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005

[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013

[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010

[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012

[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013

[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014

[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004

[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988

[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004

[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002

[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000

[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996

[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical

Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000

[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003

[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013

Page 2: ADLD: A Novel Graphical Representation of Protein

2 Computational and Mathematical Methods in Medicine

the methods or the lower the recognition degree of the pro-tein sequence graphs For example in the methods proposedin [26 28] the main drawback is that the lines will cross eachother which will decrease the visibility of the graphics In themethods proposed in [29ndash31] the main drawbacks are thatthe 3D graphics seem to be more complex and have lowervisibility than the 2D graphics and in addition to obtain thesequence invariants from the graphics complex matrixes arerequired to be constructed which need much computationand storage

Sequence alignment is a way of arranging the sequencesof DNA RNA or protein to identify regions of similaritythat may be a consequence of functional structural orevolutionary relationships between the sequences [33] Upto now there are many kinds of algorithms having beenimplemented for sequence alignment [34ndash37] These meth-ods are usually efficient but complex and time consumingComparing with the alignment methods existing graphicalrepresentation methods can also display the inner structureof the protein sequences and can be utilized to find the sim-ilaritydissimilarity more visible according to their graphicsIn this paper we proposed a novel method for analyzing thesimilaritydissimilarity by combining the idea of the sequencealignment and the graphical representation methods to somedegree avoid the weakness of both of these two methods

Principal components analysis (PCA) is a standard toolin multivariate data analysis to reduce the number of dimen-sions which has been proved to be effective in the processof protein sequence analysis [38ndash40] Therefore in order toovercome the main drawbacks of existing methods in thispaper a novel graphical representation of protein sequencescalled ADLD (Alignment Diagonal Line Diagram) is intro-duced based on PCA and then a newADLD basedmethod isproposed and utilized to analyze the similaritydissimilarityof protein sequences And in addition to validate theeffectiveness of our ADLD based method we adopt it toanalyze the similaritydissimilarity of both the 16 differentND5 proteins and the 29 different spike proteins respectivelywhich are widely used as the test data [16ndash26] The analysisresults show that our method is not only visual intuitionaland effective in the similaritydissimilarity analysis of proteinsequences but also quite simple since there are no highdimensional matrixes required to be constructed

2 Materials and Methods

21 Procedure of OurMethod for Analysis of Protein SequencesIn this section we will illustrate the overall procedures of ourmethod for analyzing protein sequences as follows at first

(1) Select the same 9 different properties for each aminoacid and construct a 20 times 9 matrix as the input dataof the PCA algorithm on the basis of total 20 differentamino acids

(2) According to the PCA algorithm we can obtain aunique feature for each amino acid

(3) For each protein sequence in the test data we willreplace each amino acid in the protein sequence

with its corresponding unique feature and then wecan transform the protein sequence into a numericalsequence

(4) For any two numerical sequences we can drawa graph named ADLD and then abstract somenumerical characteristics of it which can be utilizedto analyze the similaritydissimilarity of these twosequences

Next in Sections 22ndash26 we will introduce the detailsof constructing the ADLDs and obtaining some of thenumerical characteristics of them In Section 31 we will givethe method for constructing the similaritydissimilarity ofour test sequence groups

22 AminoAcids andTheir Properties Proteins are composedof 20 different amino acids and these amino acids have manydifferent physicochemical and biological properties such asthe molecular weight (mW) hydropathy index (hI) the pKavalue for terminal amino acid groups COOH (pK1) the pKavalue for terminal amino acid groupsNH

3

+ (pK2) isoelectricpoint (pI) solubility (119878) the number of triplet codons (cN)frequency of human proteins (119865) and van der Waals radiusof side chains (vR) The names and symbols of the 20 aminoacids and the value of their 9 major properties are illustratedin Table 1

23 Principal Components Analysis Principal componentsanalysis (PCA) is a common technique for dimensionalityreduction and pattern recognition in datasets of high dimen-sion [41] The main purposes of PCA are the analysis ofdata to identify patterns and finding patterns to reduce thedimensions of the dataset with minimal loss of informationThe general steps of conducting PCA are as follows

Step 1 For119898 samples 1198831 1198832 119883

119898 suppose that each119883

119894

has 119899 components 1199091198941 1199091198941 119909

119894119899 let119883

119894= (1199091198941 1199091198941 119909

119894119899)

for 119894 isin 1 2 119898 and then construct an 119898 times 119899 matrix Xaccording to the following formula first

X =

[[[[

[

1198831

1198832

119883119898

]]]]

]

=

[[[[

[

11990911

11990912

sdot sdot sdot 1199091119899

11990921

11990922

sdot sdot sdot 1199092119899

1199091198981

1199091198982

sdot sdot sdot 119909119898119899

]]]]

]

(1)

Next based on thematrixX construct the corresponding119898 times 119899 standardized matrix Xlowast according to the followingformula

Xlowast =

[[[[[[

[

119883lowast

1

119883lowast

2

119883lowast

119898

]]]]]]

]

=

[[[[[[

[

119909lowast

11119909lowast

12sdot sdot sdot 119909

lowast

1119899

119909lowast

21119909lowast

22sdot sdot sdot 119909

lowast

2119899

119909lowast

1198981119909lowast

1198982sdot sdot sdot 119909lowast

119898119899

]]]]]]

]

(2)

where 119883lowast

119894= (119909lowast

1198941 119909lowast

1198942 119909

lowast

119894119899) 119909lowast119894119895

= (119909119894119895minus 119909119895)radicvar(119909

119895) 119909119895=

(1119899)sum119899

119894=1119909119894119895 and var(119909

119895) = (1(119899 minus 1))sum

119899

119894=1(119909119894119895

minus 119909119895)2 for

119894 isin 1 2 119898 and 119895 isin 1 2 119899

Computational and Mathematical Methods in Medicine 3

Table 1 The full list of 20 amino acids and the value of their 9 different properties

Amino acid Symbol mW hI pK1 pK2 pI 119878 cN 119865 () vRAlanine A 89079 18 234 969 601 1672 4 78 67Cysteine C 121145 25 196 1028 507 0 2 19 86Aspartic acid D 133089 minus35 188 96 277 5 2 53 91Glutamic acid E 147116 minus35 219 967 322 85 2 63 109Phenylalanine F 165177 28 183 913 548 276 2 39 135Glycine G 75052 minus04 234 96 597 2499 4 72 48Histidine H 155141 minus32 182 917 759 0 2 23 118Isoleucine I 13116 45 236 968 602 345 3 53 124Lysine K 14617 minus39 218 895 974 739 2 59 135Leucine L 13116 38 236 96 598 217 6 91 124Methionine M 149199 19 228 921 574 562 1 23 124Asparagine N 132104 minus35 202 88 541 285 2 43 96Proline P 115117 16 199 1096 648 1620 4 52 90Glutamine Q 146131 minus35 217 913 565 72 2 42 114Arginine R 174188 minus45 217 904 1076 8556 6 51 148Serine S 105078 minus08 221 915 568 422 6 68 73Tyrosine T 119105 minus07 211 962 587 132 4 59 93Valine V 117133 42 232 962 597 581 4 66 105Tryptophan W 204213 minus09 238 939 589 136 1 14 163Threonine Y 181176 minus13 22 911 566 04 2 32 141

Step 2 Based on thematrixXlowast construct the 119899times119899 correlationmatrix R according to the following formula

R =

[[[[

[

1198771

1198772

119877119899

]]]]

]

=

[[[[

[

11990311

11990312

sdot sdot sdot 1199031119899

11990321

11990322

sdot sdot sdot 1199032119899

1199031198991

1199031198992

sdot sdot sdot 119903119899119899

]]]]

]

(3)

where we can find that 119903119894119895

= sum119898

119896=1(119909lowast

119896119894minus 119909lowast

119894)(119909lowast

119896119895minus 119909lowast

119895)

radicsum119898

119896=1(119909lowast

119896119894minus 119909lowast

119894)2sum119898

119896=1(119909lowast

119896119895minus 119909lowast

119895)2 for 119894 isin 1 2 119899 and

119895 isin 1 2 119899

Step 3 From the correlationmatrixR obtain its 119899 eigenvalues1205821ge 1205822ge sdot sdot sdot ge 120582

119899gt 0 and the corresponding 119899 eigenvectors

1198861=

[[[[

[

11988611

11988621

1198861198991

]]]]

]

1198862=

[[[[

[

11988612

11988622

1198861198992

]]]]

]

119886119899=

[[[[

[

1198861119899

1198862119899

119886119899119899

]]]]

]

(4)

respectively And from now on we can obtain 119899 principalcomponents 119865

119894for 119894 isin 1 2 119899 as follows

119865119894= 1198861119894119883lowast

1+ 1198862119894119883lowast

2+ sdot sdot sdot + 119886

119899119894119883lowast

119899 (5)

Step 4 For each principal component 119865119894for 119894 isin 1 2 119899

obtain its contribution rate CR119894and accumulated contribution

rate ACR119894according to the following formulas respectively

CR119894=

120582119894

sum119899

119896=1120582119896

(6)

ACR119894=

119894

sum

119896=1

CR119896 (7)

Generally in order to lower the computation complexitywe can keep only the first 119905 (119905 le 119899) principal components1198651 1198652 119865

119905 where the accumulated contribution rate of

the 119905th principal component 119865119905shall satisfy the fact that

ACR119905ge 85

Step 5 For 119895 isin 1 2 119905 let

119865119895=

[[[[[[[

[

119865119895

1

119865119895

2

119865119895

119898

]]]]]]]

]

(8)

Then for each 119894 isin 1 2 119898 we can obtain the total scoreof the 119894th sample as follows

TotalScore (119894) =

119905

sum

119896=1

119865119896

119894times CR119896 (9)

4 Computational and Mathematical Methods in Medicine

Table 2 The 9 eigenvalues (120582) of R and the contribution rates (CR)and the accumulative contribution rates (ACR) of the 9 principalcomponents obtained by conducting PCA of the 20 amino acids

Number 120582 CR ACR1 32237 03582 035822 19132 02126 057083 14048 01561 072694 11876 01320 085885 04959 00551 091396 04467 00496 096357 01992 00221 098578 01218 00135 099929 00071 00008 10000

Table 3The 4 eigenvectors 1198861 1198862 1198863 1198864 corresponding to the first

4 eigenvalues in Table 2

1198861

1198862

1198863

1198864

05036 01436 00571 02158minus02454 minus01875 02304 06547minus01634 01820 06298 02288minus03101 minus01883 minus03964 0507100702 06464 minus00786 00532minus01665 04465 minus05280 01877minus03872 03931 01003 minus00532minus04377 01844 02544 minus0227304349 02643 01738 03495

24 PCA of the Amino Acids Observing Table 1 if weconsider the 20 amino acids as 20 different samples and the9 properties of each amino acid as its 9 components thenaccording to the general steps of conducting PCA illustratedin Section 23 we can obtain a 20 times 9 matrix X and itsstandardized matrix Xlowast a 9 times 9 correlation matrix R and9 principal components 119865

1 1198652 119865

9 And therefore as

illustrated in Table 2 we can obtain the 9 eigenvalues of Rand the contribution rates and the accumulative contributionrates of the 9 principal components 119865

1 1198652 119865

9 respec-

tivelyFrom Table 2 we can see that the accumulative con-

tribution rate of the first 4 principal components amountsto 08588 (=8588) which is already bigger than 85Therefore we can keep the first 4 principal components onlyLet 120582

1 1205822 1205823 1205824 be the 4 eigenvalues corresponding to the

first 4 principal components respectively then as illustratedin Table 3 we can obtain the 4 eigenvectors 119886

1 1198862 1198863 1198864

corresponding to the 4 eigenvalues 1205821 1205822 1205823 1205824 separately

Based on Table 3 we can obtain the first 4 principalcomponents 119865

1 1198652 1198653 1198654 as follows

1198651= 05036119883

lowast

1minus 02454119883

lowast

2minus 01634119883

lowast

3

minus 03101119883lowast

4+ 00702119883

lowast

5minus 01665119883

lowast

6

minus 03872119883lowast

7minus 04377119883

lowast

8+ 04349119883

lowast

9

Table 4 The total scores of the 20 amino acids

Symbols of amino acids Total scoresA minus09324C minus05985D minus06709E minus02296F 04298G minus11780H 04476I 01435K 07868L minus01205M 05735N minus00242P minus09822Q 02848R 11169S minus07077T minus04525V minus02643W 14729Y 09050

1198652= 01436119883

lowast

1minus 01875119883

lowast

2+ 01820119883

lowast

3

minus 01883119883lowast

4+ 06464119883

lowast

5+ 04465119883

lowast

6

+ 03931119883lowast

7+ 01844119883

lowast

8+ 02643119883

lowast

9

1198653= 00571119883

lowast

1+ 02304119883

lowast

2+ 06298119883

lowast

3

minus 03964119883lowast

4minus 00786119883

lowast

5minus 05280119883

lowast

6

+ 01003119883lowast

7+ 02544119883

lowast

8+ 01738119883

lowast

9

1198654= 02158119883

lowast

1+ 06547119883

lowast

2+ 02288119883

lowast

3

+ 05071119883lowast

4+ 00532119883

lowast

5+ 01877119883

lowast

6

minus 00532119883lowast

7minus 02273119883

lowast

8+ 03495119883

lowast

9

(10)

Observing the above 4 formulas it is easy to find thatthere are three big coefficients in the first formula whichare 05036 (corresponding to mW) 04377 (corresponding to119865) and 04349 (corresponding to vR) respectivelyThereforeit means that the three properties such as mW 119865 andvR will have a major role in the first principal component1198651 Similarly we can also know that the three properties

such as pI 119878 and cN will have a major role in the secondprincipal component 119865

2 the third principal component 119865

3is

mainly determined by pK1 and 119878 and the fourth principalcomponent 119865

4is closely linked with hI and pK2 and so forth

Hence we can obtain the total scores of the 20 amino acids asillustrated in Table 4 according to formula (9)

Computational and Mathematical Methods in Medicine 5

25 Numerical Sequences of Protein Sequences LetΩ = AC

DE FGH IK LMNPQR STVWY and supposethat Ψ = 119901

111990121199013 119901119873represents a protein sequence with

119873 amino acids where 119901119894

isin Ω for 119894 isin 1 2 119873 thenwe can obtain a numerical sequence 119878

Ψ= (1199051 1199052 119905

119873)

corresponding to the protein sequence Ψ through replacingeach amino acid 119901

119894in Ψ with its corresponding value of

TotalScore(119894) for 119894 isin 1 2 119873For example consider the following 3 abbreviated protein

sequences

Hu = MTMHTTMTTLGor = MTMYATMTTLOpo = MKVINISNTM

According to the above descriptions and Table 4 thenwe can obtain their corresponding numerical sequences asfollows119878Hu = 05735 minus04525 05735 04476 minus04525 minus04525

05735 minus04525 minus04525 minus01205

119878Gor = 05735 minus04525 05735 09050 minus09324 minus04525

05735 minus04525 minus04525 minus01205

119878Opo = 05735 07868 minus02643 01435 minus00242 01435

minus07077 minus00242 minus04525 05735

(11)

26 ASDs and ADLDs of Protein Sequence Pairs For agiven protein sequence pair (119904

1 1199042) suppose that the protein

sequence 1199041includes 119873

1amino acids 119904

2includes 119873

2amino

acids and 1198731

⩾ 1198732 then in order to measure the

similaritydissimilarity between them in this section we willpresent a new method called Alignment Scatter Diagram(ASD) to plot the two sequences into a scatter diagram firstAnd for convenience we call the points in the ASD thealignment-plots (APs) The ASD of the protein sequence pair(1199041 1199042) can be obtained through the following steps

Step 1 According to the method given in Section 25 trans-late the protein sequence pair (119904

1 1199042) into two numerical

sequences with the same length as follows

1198781= 1199051 1199052 119905

1198731

1198782= 1199051 1199052 119905

1198732

0 0 0⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

1198731minus1198732

(12)

Step 2 Let 119908 be the alignment width (AW) of the proteinsequence pair (119904

1 1199042) that is let 119904

1= 1199011 1199012 1199013 119901

1198731

1199042=

1199021 1199022 1199023 119902

1198732

then for any amino acid 119901119894in the protein

sequence 1199041 we will compare it with these 2119908 + 1 amino

acids 119902119894minus119908

119902119894minus1

119902119894 119902119894+1

119902119894+119908

in the protein sequence1199042 and then 119908 can be simply defined as follows

119908 =

120585 if 1198731minus 1198732le 120585

1198731minus 1198732 else

(13)

where 120585 gt 0 is a given threshold to guarantee that the AWof the protein sequence pair (119904

1 1199042) will not be too small to

expose the association of the inner structures of the proteinsequence pair (119904

1 1199042) In actual applications we suggest that

120585 shall be no less than 10

Step 3 Let 120576 gt 0 be the dissimilarity degree (DD) of twoamino acids that is if 120576 = 0 then it means that the two aminoacids are the same otherwise it means that the two aminoacids are different from each other to some degree and thenthe APs in the ASD of the protein sequence pair (119904

1 1199042) can

be briefly defined as follows

119860120576

119894119895= Θ (

10038161003816100381610038161003816119905119894minus 119905119895

10038161003816100381610038161003816minus 120576) (14)

where 119894 isin 1 2 1198731 119895 isin 1 2 119873

1 and Θ is a

Heaviside function which can be defined as follows

Θ (119909) =

1 if 119909 le 0

0 else(15)

Thereafter we can obtain an 1198731times 1198731alignment matrix

(AM) as follows

AM = (119860120576

119894119895)1198731times1198731

(16)

Step 4 For the1198731times1198731elements in the alignmentmatrix AM

we can plot points on 119894-119895 plane for these elements in the AMwith 119860

120576

119894119895= 1 and |119894 minus 119895| le 119908 And for convenience we call

the obtained graph the Alignment Scatter Diagram (ASD) ofthe protein sequence pair (119904

1 1199042)

For example considering the three 120573-globin pro-tein sequences of chimpanzee [GenBank AAA163341]human [GenBank CAA262041] and gorilla [GenBankCAA434211] obtained from the GenBank respectively weillustrate the ASDs of the 120573-globin protein sequence pair(chimpanzee human) and the 120573-globin protein sequencepair (human gorilla) in Figures 1(a) and 1(b) separately whileletting 120576 = 0

From Figure 1 it is easy to see that there are lots of disor-dered points in these ASDs which will lower the visuality ofthe ASDs remarkably and obstruct us fromdistinguishing thesimilaritydissimilarity between the protein sequence pairsintuitively while observing these ASDsTherefore in order toimprove the intuition of theASD wewill propose a simplifiedvariant diagram of the ASD which is called the AlignmentDiagonal Line Diagram (ADLD)

For convenience in an ASD we call its main diagonalline the artery tracks (ATs) and the lines parallelling to itsmain diagonal line the by-path tracks (BTs) respectively Andin addition we define a set consisting with no less than 120575

consecutive APs on the AT or BTs as a CAPS where 120575 ge 1

is a given thresholdFor a given CAPS caps

1 if there is no CAPS caps

2

satisfying caps1

sub caps2 then we call the caps

1a maximum

CAPS And for convenience we call the line formed byconnecting all of the APs in a maximum CAPS a similarfragment (SF) and simultaneously we call all of the APs onthe AT but not on any SFs the free points (FPs)

6 Computational and Mathematical Methods in Medicine

Obviously in an ASD if keeping all of the SFs and FPsonly and omitting all those other APs then we will obtain asimplified variant diagram of the ASD and for conveniencewe call it the Alignment Diagonal Line Diagram (ADLD)Apparently if 120575 = 1 then an ADLD will degenerate into anASD Therefore in actual applications we suggest that 120575 willbe no less than 2 And particularly in order to find moreaccurate SFs in the ADLD of a protein sequence pair thelonger the protein sequences in the protein sequence pair arethe bigger the value of 120575 shall be

For convenience of analysis in an ADLD suppose thatthere are 119870

1different SFs and 119870

2different FPs on its AT

119870 different BTs locating above its AT and 119870 different BTslocating below its AT then we get the following

(1) For these 1198701different SFs and 119870

2different FPs

on the AT of the ADLD we will number these1198701SFs and 119870

2FPs from left to right and utilize

ASF1ASF2 ASF1198701 and FP1 FP2 FP1198702 torepresent these 119870

1SFs and 119870

2FPs separately And

in addition we would also call these SFs on the AT ofthe ADLD the ASFs

(2) For these 119870 different BTs locating above the AT wewill number these BTs from down to up and utilizeBT1BT2 BT

119870 to represent these BTs separately

and for these 119870 different BTs locating below theAT we will number these BTs from up to down andutilize BT

minus1BTminus2

BTminus119870

to represent these BTsseparately

(3) For each BT119897 where 119897 isin 1 2 119870 suppose

that there are 1198703different SFs on the BT

119897 then we

will number these 1198703SFs from left to right and

utilize BSF1119897BSF2119897 BSF1198703

119897 to represent these SFs

separately And in addition we would also call theseSFs on the BTs of the ADLD the BSFs

According to the above assumptions in Figure 2 we showthe two ADLDs corresponding to the ASDs illustrated inFigures 1(a) and 1(b) while letting 120575 = 3 And in additionto make the ADLDs more visual and intuitional in Figure 2we use the red ldquolowastrdquo to represent the FPs on the AT and the bluelines to represent the SFs on the AT or BTs

From Figure 2(a) it is easy to see that there are two SFsin the ADLD of the sequence pair (chimpanzee human)one is ASF1 that is the line segment from the point (1 1)

to the point (32 32) and the other is BSF1minus4 that is the

line segment from the point (35 31) to the point (125 121)And in addition there are totally 6 FPs in the ADLD whichare FP1(46 46) FP2(66 66) FP3(111 111) FP4(114 114)FP5(115 115) and FP6(123 123) respectively

Observing Figure 2(b) we can easily find that there arealso two SFs in the ADLD of the sequence pair (humangorilla) But different from that in Figure 2(a) the two SFsin Figure 2(b) are both ASFs one is ASF1 that is the linesegment from the point (1 1) to the point (104 104) andthe other is ASF2 that is the line segment from the point(106 106) to the point (121 121) And in addition the twoASFs in Figure 2(b) are separated by one gap and there existno FPs or BSFs on the AT or BTs

Through analysis we can know that for a given proteinsequence pair if there exist some deletions or insertions ofamino acid segments between the two protein sequencesthen there will exist some misalignments of SFs in theirADLD that is some ASFs on the AT will be transformedinto BSFs on some BTs And in addition if there exist somesubstitutions of the amino acids between the two proteinsequences then in their ADLD there will exist some gapsbetween two neighboring SFs or FPs on the AT Furthermoreif there exist some insertions deletions or substitutions of theamino acid segments at the end of the two protein sequencesthen in their ADLD there will exist no SFs or FPs on the ATor BTs

From the above descriptions it is easy to know thatthe ADLD of any given protein sequence pair obtainedby our above proposed method reflects some inner andspecific differences between these two protein sequences inthe given protein sequence pair which may be useful in thesimilaritydissimilarity analysis of protein sequence pairs

3 Results and Discussion

31 Method for SimilarityDissimilarity Analysis of ProteinSequences Based on the ADLDs According to the aboveanalysis we have known that the ADLDs may be useful inanalyzing the differences of the inner structures of proteinsequence pairs In this section we will show how to utilizethe ADLDs to analyze the similaritydissimilarity of a groupof protein sequences

Generally suppose that there are 119873 protein sequencesΨ1 Ψ2 Ψ

119873 then while applying the ADLDs to analyze

the similaritydissimilarity of these119873 sequences the similar-itydissimilarity matrix of these119873 sequences can be obtainedthrough the following steps

Step 1 According to the method given in Section 25 trans-form these 119873 protein sequences into119873 numerical sequences1198781 1198782 119878

119873

Step 2 For a given protein sequence pair Ψ119886 Ψ119887 119886 isin

1 2 119873 119887 isin 1 2 119873 we can obtain their ADLDthrough adopting the method proposed in Section 26 andthen we can obtain all of the SFs (including ASFs and BSFs)and FPs in the ADLD Hence we can obtain the lengths oftheseASFs the lengths of these BSFs and the number of theseFPs respectively

Step 3 Suppose that there are totally 1198711different ASFs

such as ASF1ASF2 ASF1198711 1198712different BSFs such

as BSF11198971

BSF21198972

BSF11987121198971198712

and 1198713different FPs such as

FP1 FP2 FP1198713 in the ADLD And in addition for eachASF119894 and BSF119895

119897119895

let their length be length119894 and length119895

respectively where 119894 isin 1 2 1198711 and 119895 isin 1 2 119871

2

then we can define the similarity degree (SD) of Ψ119886 Ψ119887 as

follows

SD (Ψ119886 Ψ119887) =

1198711

sum

119894=1

length119894 +1198712

sum

119895=1

length119895+ 1198713 (17)

Computational and Mathematical Methods in Medicine 7

150

100

50

0150100500

(a)

150

100

50

0150100500

(b)

Figure 1 (a)TheASD of the 120573-globin protein sequence pair (chimpanzee human) with 120585 = 12 (b) the ASD of the 120573-globin protein sequencepair (human gorilla) with 120585 = 16

150

100

50

0150100500

(a)

150

100

50

0

150100500

(b)

Figure 2 (a) The ADLD of the protein sequence pair (chimpanzee human) (b) the ADLD of the protein sequence pair (human gorilla)

And therefore according to these 119873 protein sequencesΨ1 Ψ2 Ψ

119873 we can obtain an 119873 times 119873 matching matrix

(MM) as follows

MM = [119889119894119895]119873times119873

(18)

where

119889119894119895

=

SD (Ψ119894 Ψ119895) if 119894 ge 119895

0 else

for 119894 isin 1 2 119873 119895 isin 1 2 119873

(19)

Step 4 Based on the matching matrixMM and all of its com-ponents 119889

119894119895 where 119894 isin 1 2 119873 and 119895 isin 1 2 119873 then

we can obtain an 119873 times 119873 similaritydissimilarity matrix (SM)of these 119873 protein sequences Ψ

1 Ψ2 Ψ

119873 as follows

SM = [119904119894119895]119873times119873

(20)

where

119904119894119895

=

1 minus Λ(

119889119894119895

119889119894119894

) if 119894 ge 119895

0 else

Λ(

119889119894119895

119889119894119894

) =

1 if119889119894119895

119889119894119894

ge 1

119889119894119895

119889119894119894

else

for 119894 isin 1 2 119873 119895 isin 1 2 119873

(21)

According to the above steps we present an examplethrough implementing the ADLDs to analyze the similar-itydissimilarity of 16 ND5 proteins (illustrated in Table 5)while letting 120575 = 3 and illustrate the results of similar-itydissimilarity matrix in Table 6

Observing Table 6 it is easy to find that there are somesimilar pairs such as (c-chim pi-chim) with the distance00510 (human c-chim) with the distance 00814 (human

8 Computational and Mathematical Methods in Medicine

Table 5 The basic information of 16 ND5 protein sequences

Number Name Abbreviation Access number Length1 Human Human ADT804301 6032 Gorilla Gorilla NP 008222 6033 Pigmy chimpanzee Pi-chim NP 008209 6034 Common chimpanzee C-chim NP 008196 6035 Fin-whale Fin-whale NP 006899 6066 Blue-whale Blue-whale NP 007066 6067 Rat Rat AP 0049021 6108 Mouse Mouse NP 904338 6079 Opossum Opossum NP 007105 60210 Sheep Sheep ABW229031 60611 Goat Goat BAN592581 60612 Lemur Lemur CAD134311 60313 Cattle Cattle ADN119021 60614 Hare Hare CAD132911 60315 Gallus Gallus BAE160361 60516 Rabbit Rabbit NP 0075591 603

Sheep

Goat

Cattle

Fin-whale

Blue-whale

Lemur

Hare

Rabbit

Gorilla

Human

Pi-chim

C-chim

Rat

Mouse

Opossum

Gallus

00383

00468

00255

00255

00162

00162

00942

00942

02169

00356

00356

01084

00540

00419

02311

00419

00855

00128

00184

00085

00665

01080

00477

00770

00243

00176

00288

00163

00458

00142

000005010015020

Figure 3 The phylogenetic tree of the 16 species based on the ADLDs based method

pi-chim) with the distance 00720 (gorilla c-chim) with thedistance 00865 (gorilla pi-chim) with the distance 00833and (fin-whale blue-whale) with the distance 00324 Andamong them the opossum seems to be a peculiar mammalsince the shortest distance between it and the remaining

mammals is more than 04023 Obviously the result isconsistent with the fact that opossum is the most remotespecies from the remaining mammals

Additionally gallus seems to be more peculiar thanopossum since the shortest distance between it and

Computational and Mathematical Methods in Medicine 9

Table6Th

esim

ilaritydissim

ilaritymatrix

forthe

16ND5proteins

basedon

theA

DLD

sbased

metho

d

Hum

anGorilla

Pi-chim

C-chim

Fin-whale

Blue-w

hale

rat

mou

seop

ossum

sheep

goat

lemur

cattle

hare

gallu

srabbit

Hum

an000

00Gorilla

01111

00000

Pi-chim

007

20008

33000

00C-

chim

008

14008

65005

10000

00Fin-whale

03396

03285

03222

03301

00000

Blue-w

hale

03474

03333

03285

03301

00324

000

00Ra

t03693

03622

03636

03716

03333

03381

000

00Mou

se03740

03686

03716

03748

03317

03333

01883

000

00Opo

ssum

044

7604551

04290

044

1804515

04519

04513

044

79000

00Sheep

03020

02933

02871

02951

02023

02067

03149

03219

04121

00000

Goat

03036

02901

02871

02919

01958

02147

03166

03468

04202

007

12000

00Lemur

02989

02708

02839

02967

02557

02724

03166

03670

040

5502087

02317

000

00Ca

ttle

03114

03045

03046

03062

01958

02051

03149

03173

04235

009

0601254

02184

000

00Hare

03146

03157

03062

03046

02832

02788

03166

03421

04023

02217

02508

02053

02532

000

00Gallus

04726

04920

04737

05008

044

50044

2304903

04743

04691

04239

04524

046

8004183

046

6000000

Rabb

it03255

03189

03222

03142

02896

02756

03084

03390

04332

02184

02603

02282

02612

008

37044

34000

00

10 Computational and Mathematical Methods in Medicine

the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal

Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences

32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species

Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3

From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree

To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7

For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points

Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)

Table 7 The basic information of 29 spike proteins

Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255

From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method

Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively

Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang

33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we

Computational and Mathematical Methods in Medicine 11

BCoVE

BCoVL

BCoVM

BCoVQ

HCoVOC43

MHVJHM

MHVP

MHVA

MHVM

TGEVG

TGEV

PEDVC

PEDV

TOR2

TaiwanTC1

FRA

BJ01

GD01

GZ02

civet007

A022

civet010

PC4137

GD03T0013

PC4127

PC4205

IBV

IBVBJ

IBVC

00000

00000

00000

00000

00369

00026

00026

00022

00022

00019

00559

00221

00019

00950

00950

01221

00010

00004

00015

00004

00004

00006

00004

00012

00012

00011

00006

00004

00004

04350

04350

00002

00002

00006

00005

00020

00005

00019

00018

00011

00202

00116

00112

00044

00040

04560

00232

00337

03498

03309

0027203461

00713

00230

00049

00052

Group II

Group I

Group III

SARS CoVs

Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5

will further discuss the intuition and visuality of the ADLDsin detail

From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3

Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)

Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total

12 Computational and Mathematical Methods in Medicine

PEDVC

PEDV

GD03T0013

BJ01

GD01

MHVA

PC4205

GZ02

TOR2

MHVM

PC4137

FRA

PC4127

TaiwanTC1

MHVP

A022

civet007

civet010

TGEVG

TGEV

BCoVM

BCoVQ

HCoVOC43

IBVC

MHVJHM

BCoVE

BCoVL

IBVBJ

IBV

00000

00000

00000

00000

00013

00006

00006

00001

00001

00000

00012

00002

00001

00010

00013

00010

00000

00000

00000

00000

00000

00000

00001

00001

00000

00000

00000

00001

0000000000

00000

00000

00006

00000

00000

00001

00000

00000

00001

00002

00001

00000

00002

00004

00002

00003

00003

00008

00006

00008

00067

00022

00013

00013

00008

00042

Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang

length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short

And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively

Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs

Computational and Mathematical Methods in Medicine 13

Fin-whale

Rabbit

Blue-whale

Sheep

Goat

Hare

Rat

Lemur

Mouse

Cattle

Opossum

Gorilla

Human

Pi-chim

C-chim

Gallus

00006

00037

00004

00004

00001

00002

00000

00001

00074

00002

00015

00000

00001

00011

00183

00001

00006

00006

00004

00005

00002

00005

00031

00008

00024

00020

00039

00060

00024

00086

Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang

700

600

500

400

300

200

100

07006005004003002001000

(a)

700

600

500

400

300

200

100

07006005004003002001000

(b)

700

600

500

400

300

200

100

07006005004003002001000

(c)

Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)

14 Computational and Mathematical Methods in Medicine

in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)

Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields

4 Conclusions

In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)

References

[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999

[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994

[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004

[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004

[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007

[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007

[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008

[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001

[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009

[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009

[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006

[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007

[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011

[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009

[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007

[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008

[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009

[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010

[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010

[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008

[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012

Computational and Mathematical Methods in Medicine 15

[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012

[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012

[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008

[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002

[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004

[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005

[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013

[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010

[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012

[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013

[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014

[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004

[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988

[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004

[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002

[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000

[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996

[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical

Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000

[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003

[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013

Page 3: ADLD: A Novel Graphical Representation of Protein

Computational and Mathematical Methods in Medicine 3

Table 1 The full list of 20 amino acids and the value of their 9 different properties

Amino acid Symbol mW hI pK1 pK2 pI 119878 cN 119865 () vRAlanine A 89079 18 234 969 601 1672 4 78 67Cysteine C 121145 25 196 1028 507 0 2 19 86Aspartic acid D 133089 minus35 188 96 277 5 2 53 91Glutamic acid E 147116 minus35 219 967 322 85 2 63 109Phenylalanine F 165177 28 183 913 548 276 2 39 135Glycine G 75052 minus04 234 96 597 2499 4 72 48Histidine H 155141 minus32 182 917 759 0 2 23 118Isoleucine I 13116 45 236 968 602 345 3 53 124Lysine K 14617 minus39 218 895 974 739 2 59 135Leucine L 13116 38 236 96 598 217 6 91 124Methionine M 149199 19 228 921 574 562 1 23 124Asparagine N 132104 minus35 202 88 541 285 2 43 96Proline P 115117 16 199 1096 648 1620 4 52 90Glutamine Q 146131 minus35 217 913 565 72 2 42 114Arginine R 174188 minus45 217 904 1076 8556 6 51 148Serine S 105078 minus08 221 915 568 422 6 68 73Tyrosine T 119105 minus07 211 962 587 132 4 59 93Valine V 117133 42 232 962 597 581 4 66 105Tryptophan W 204213 minus09 238 939 589 136 1 14 163Threonine Y 181176 minus13 22 911 566 04 2 32 141

Step 2 Based on thematrixXlowast construct the 119899times119899 correlationmatrix R according to the following formula

R =

[[[[

[

1198771

1198772

119877119899

]]]]

]

=

[[[[

[

11990311

11990312

sdot sdot sdot 1199031119899

11990321

11990322

sdot sdot sdot 1199032119899

1199031198991

1199031198992

sdot sdot sdot 119903119899119899

]]]]

]

(3)

where we can find that 119903119894119895

= sum119898

119896=1(119909lowast

119896119894minus 119909lowast

119894)(119909lowast

119896119895minus 119909lowast

119895)

radicsum119898

119896=1(119909lowast

119896119894minus 119909lowast

119894)2sum119898

119896=1(119909lowast

119896119895minus 119909lowast

119895)2 for 119894 isin 1 2 119899 and

119895 isin 1 2 119899

Step 3 From the correlationmatrixR obtain its 119899 eigenvalues1205821ge 1205822ge sdot sdot sdot ge 120582

119899gt 0 and the corresponding 119899 eigenvectors

1198861=

[[[[

[

11988611

11988621

1198861198991

]]]]

]

1198862=

[[[[

[

11988612

11988622

1198861198992

]]]]

]

119886119899=

[[[[

[

1198861119899

1198862119899

119886119899119899

]]]]

]

(4)

respectively And from now on we can obtain 119899 principalcomponents 119865

119894for 119894 isin 1 2 119899 as follows

119865119894= 1198861119894119883lowast

1+ 1198862119894119883lowast

2+ sdot sdot sdot + 119886

119899119894119883lowast

119899 (5)

Step 4 For each principal component 119865119894for 119894 isin 1 2 119899

obtain its contribution rate CR119894and accumulated contribution

rate ACR119894according to the following formulas respectively

CR119894=

120582119894

sum119899

119896=1120582119896

(6)

ACR119894=

119894

sum

119896=1

CR119896 (7)

Generally in order to lower the computation complexitywe can keep only the first 119905 (119905 le 119899) principal components1198651 1198652 119865

119905 where the accumulated contribution rate of

the 119905th principal component 119865119905shall satisfy the fact that

ACR119905ge 85

Step 5 For 119895 isin 1 2 119905 let

119865119895=

[[[[[[[

[

119865119895

1

119865119895

2

119865119895

119898

]]]]]]]

]

(8)

Then for each 119894 isin 1 2 119898 we can obtain the total scoreof the 119894th sample as follows

TotalScore (119894) =

119905

sum

119896=1

119865119896

119894times CR119896 (9)

4 Computational and Mathematical Methods in Medicine

Table 2 The 9 eigenvalues (120582) of R and the contribution rates (CR)and the accumulative contribution rates (ACR) of the 9 principalcomponents obtained by conducting PCA of the 20 amino acids

Number 120582 CR ACR1 32237 03582 035822 19132 02126 057083 14048 01561 072694 11876 01320 085885 04959 00551 091396 04467 00496 096357 01992 00221 098578 01218 00135 099929 00071 00008 10000

Table 3The 4 eigenvectors 1198861 1198862 1198863 1198864 corresponding to the first

4 eigenvalues in Table 2

1198861

1198862

1198863

1198864

05036 01436 00571 02158minus02454 minus01875 02304 06547minus01634 01820 06298 02288minus03101 minus01883 minus03964 0507100702 06464 minus00786 00532minus01665 04465 minus05280 01877minus03872 03931 01003 minus00532minus04377 01844 02544 minus0227304349 02643 01738 03495

24 PCA of the Amino Acids Observing Table 1 if weconsider the 20 amino acids as 20 different samples and the9 properties of each amino acid as its 9 components thenaccording to the general steps of conducting PCA illustratedin Section 23 we can obtain a 20 times 9 matrix X and itsstandardized matrix Xlowast a 9 times 9 correlation matrix R and9 principal components 119865

1 1198652 119865

9 And therefore as

illustrated in Table 2 we can obtain the 9 eigenvalues of Rand the contribution rates and the accumulative contributionrates of the 9 principal components 119865

1 1198652 119865

9 respec-

tivelyFrom Table 2 we can see that the accumulative con-

tribution rate of the first 4 principal components amountsto 08588 (=8588) which is already bigger than 85Therefore we can keep the first 4 principal components onlyLet 120582

1 1205822 1205823 1205824 be the 4 eigenvalues corresponding to the

first 4 principal components respectively then as illustratedin Table 3 we can obtain the 4 eigenvectors 119886

1 1198862 1198863 1198864

corresponding to the 4 eigenvalues 1205821 1205822 1205823 1205824 separately

Based on Table 3 we can obtain the first 4 principalcomponents 119865

1 1198652 1198653 1198654 as follows

1198651= 05036119883

lowast

1minus 02454119883

lowast

2minus 01634119883

lowast

3

minus 03101119883lowast

4+ 00702119883

lowast

5minus 01665119883

lowast

6

minus 03872119883lowast

7minus 04377119883

lowast

8+ 04349119883

lowast

9

Table 4 The total scores of the 20 amino acids

Symbols of amino acids Total scoresA minus09324C minus05985D minus06709E minus02296F 04298G minus11780H 04476I 01435K 07868L minus01205M 05735N minus00242P minus09822Q 02848R 11169S minus07077T minus04525V minus02643W 14729Y 09050

1198652= 01436119883

lowast

1minus 01875119883

lowast

2+ 01820119883

lowast

3

minus 01883119883lowast

4+ 06464119883

lowast

5+ 04465119883

lowast

6

+ 03931119883lowast

7+ 01844119883

lowast

8+ 02643119883

lowast

9

1198653= 00571119883

lowast

1+ 02304119883

lowast

2+ 06298119883

lowast

3

minus 03964119883lowast

4minus 00786119883

lowast

5minus 05280119883

lowast

6

+ 01003119883lowast

7+ 02544119883

lowast

8+ 01738119883

lowast

9

1198654= 02158119883

lowast

1+ 06547119883

lowast

2+ 02288119883

lowast

3

+ 05071119883lowast

4+ 00532119883

lowast

5+ 01877119883

lowast

6

minus 00532119883lowast

7minus 02273119883

lowast

8+ 03495119883

lowast

9

(10)

Observing the above 4 formulas it is easy to find thatthere are three big coefficients in the first formula whichare 05036 (corresponding to mW) 04377 (corresponding to119865) and 04349 (corresponding to vR) respectivelyThereforeit means that the three properties such as mW 119865 andvR will have a major role in the first principal component1198651 Similarly we can also know that the three properties

such as pI 119878 and cN will have a major role in the secondprincipal component 119865

2 the third principal component 119865

3is

mainly determined by pK1 and 119878 and the fourth principalcomponent 119865

4is closely linked with hI and pK2 and so forth

Hence we can obtain the total scores of the 20 amino acids asillustrated in Table 4 according to formula (9)

Computational and Mathematical Methods in Medicine 5

25 Numerical Sequences of Protein Sequences LetΩ = AC

DE FGH IK LMNPQR STVWY and supposethat Ψ = 119901

111990121199013 119901119873represents a protein sequence with

119873 amino acids where 119901119894

isin Ω for 119894 isin 1 2 119873 thenwe can obtain a numerical sequence 119878

Ψ= (1199051 1199052 119905

119873)

corresponding to the protein sequence Ψ through replacingeach amino acid 119901

119894in Ψ with its corresponding value of

TotalScore(119894) for 119894 isin 1 2 119873For example consider the following 3 abbreviated protein

sequences

Hu = MTMHTTMTTLGor = MTMYATMTTLOpo = MKVINISNTM

According to the above descriptions and Table 4 thenwe can obtain their corresponding numerical sequences asfollows119878Hu = 05735 minus04525 05735 04476 minus04525 minus04525

05735 minus04525 minus04525 minus01205

119878Gor = 05735 minus04525 05735 09050 minus09324 minus04525

05735 minus04525 minus04525 minus01205

119878Opo = 05735 07868 minus02643 01435 minus00242 01435

minus07077 minus00242 minus04525 05735

(11)

26 ASDs and ADLDs of Protein Sequence Pairs For agiven protein sequence pair (119904

1 1199042) suppose that the protein

sequence 1199041includes 119873

1amino acids 119904

2includes 119873

2amino

acids and 1198731

⩾ 1198732 then in order to measure the

similaritydissimilarity between them in this section we willpresent a new method called Alignment Scatter Diagram(ASD) to plot the two sequences into a scatter diagram firstAnd for convenience we call the points in the ASD thealignment-plots (APs) The ASD of the protein sequence pair(1199041 1199042) can be obtained through the following steps

Step 1 According to the method given in Section 25 trans-late the protein sequence pair (119904

1 1199042) into two numerical

sequences with the same length as follows

1198781= 1199051 1199052 119905

1198731

1198782= 1199051 1199052 119905

1198732

0 0 0⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

1198731minus1198732

(12)

Step 2 Let 119908 be the alignment width (AW) of the proteinsequence pair (119904

1 1199042) that is let 119904

1= 1199011 1199012 1199013 119901

1198731

1199042=

1199021 1199022 1199023 119902

1198732

then for any amino acid 119901119894in the protein

sequence 1199041 we will compare it with these 2119908 + 1 amino

acids 119902119894minus119908

119902119894minus1

119902119894 119902119894+1

119902119894+119908

in the protein sequence1199042 and then 119908 can be simply defined as follows

119908 =

120585 if 1198731minus 1198732le 120585

1198731minus 1198732 else

(13)

where 120585 gt 0 is a given threshold to guarantee that the AWof the protein sequence pair (119904

1 1199042) will not be too small to

expose the association of the inner structures of the proteinsequence pair (119904

1 1199042) In actual applications we suggest that

120585 shall be no less than 10

Step 3 Let 120576 gt 0 be the dissimilarity degree (DD) of twoamino acids that is if 120576 = 0 then it means that the two aminoacids are the same otherwise it means that the two aminoacids are different from each other to some degree and thenthe APs in the ASD of the protein sequence pair (119904

1 1199042) can

be briefly defined as follows

119860120576

119894119895= Θ (

10038161003816100381610038161003816119905119894minus 119905119895

10038161003816100381610038161003816minus 120576) (14)

where 119894 isin 1 2 1198731 119895 isin 1 2 119873

1 and Θ is a

Heaviside function which can be defined as follows

Θ (119909) =

1 if 119909 le 0

0 else(15)

Thereafter we can obtain an 1198731times 1198731alignment matrix

(AM) as follows

AM = (119860120576

119894119895)1198731times1198731

(16)

Step 4 For the1198731times1198731elements in the alignmentmatrix AM

we can plot points on 119894-119895 plane for these elements in the AMwith 119860

120576

119894119895= 1 and |119894 minus 119895| le 119908 And for convenience we call

the obtained graph the Alignment Scatter Diagram (ASD) ofthe protein sequence pair (119904

1 1199042)

For example considering the three 120573-globin pro-tein sequences of chimpanzee [GenBank AAA163341]human [GenBank CAA262041] and gorilla [GenBankCAA434211] obtained from the GenBank respectively weillustrate the ASDs of the 120573-globin protein sequence pair(chimpanzee human) and the 120573-globin protein sequencepair (human gorilla) in Figures 1(a) and 1(b) separately whileletting 120576 = 0

From Figure 1 it is easy to see that there are lots of disor-dered points in these ASDs which will lower the visuality ofthe ASDs remarkably and obstruct us fromdistinguishing thesimilaritydissimilarity between the protein sequence pairsintuitively while observing these ASDsTherefore in order toimprove the intuition of theASD wewill propose a simplifiedvariant diagram of the ASD which is called the AlignmentDiagonal Line Diagram (ADLD)

For convenience in an ASD we call its main diagonalline the artery tracks (ATs) and the lines parallelling to itsmain diagonal line the by-path tracks (BTs) respectively Andin addition we define a set consisting with no less than 120575

consecutive APs on the AT or BTs as a CAPS where 120575 ge 1

is a given thresholdFor a given CAPS caps

1 if there is no CAPS caps

2

satisfying caps1

sub caps2 then we call the caps

1a maximum

CAPS And for convenience we call the line formed byconnecting all of the APs in a maximum CAPS a similarfragment (SF) and simultaneously we call all of the APs onthe AT but not on any SFs the free points (FPs)

6 Computational and Mathematical Methods in Medicine

Obviously in an ASD if keeping all of the SFs and FPsonly and omitting all those other APs then we will obtain asimplified variant diagram of the ASD and for conveniencewe call it the Alignment Diagonal Line Diagram (ADLD)Apparently if 120575 = 1 then an ADLD will degenerate into anASD Therefore in actual applications we suggest that 120575 willbe no less than 2 And particularly in order to find moreaccurate SFs in the ADLD of a protein sequence pair thelonger the protein sequences in the protein sequence pair arethe bigger the value of 120575 shall be

For convenience of analysis in an ADLD suppose thatthere are 119870

1different SFs and 119870

2different FPs on its AT

119870 different BTs locating above its AT and 119870 different BTslocating below its AT then we get the following

(1) For these 1198701different SFs and 119870

2different FPs

on the AT of the ADLD we will number these1198701SFs and 119870

2FPs from left to right and utilize

ASF1ASF2 ASF1198701 and FP1 FP2 FP1198702 torepresent these 119870

1SFs and 119870

2FPs separately And

in addition we would also call these SFs on the AT ofthe ADLD the ASFs

(2) For these 119870 different BTs locating above the AT wewill number these BTs from down to up and utilizeBT1BT2 BT

119870 to represent these BTs separately

and for these 119870 different BTs locating below theAT we will number these BTs from up to down andutilize BT

minus1BTminus2

BTminus119870

to represent these BTsseparately

(3) For each BT119897 where 119897 isin 1 2 119870 suppose

that there are 1198703different SFs on the BT

119897 then we

will number these 1198703SFs from left to right and

utilize BSF1119897BSF2119897 BSF1198703

119897 to represent these SFs

separately And in addition we would also call theseSFs on the BTs of the ADLD the BSFs

According to the above assumptions in Figure 2 we showthe two ADLDs corresponding to the ASDs illustrated inFigures 1(a) and 1(b) while letting 120575 = 3 And in additionto make the ADLDs more visual and intuitional in Figure 2we use the red ldquolowastrdquo to represent the FPs on the AT and the bluelines to represent the SFs on the AT or BTs

From Figure 2(a) it is easy to see that there are two SFsin the ADLD of the sequence pair (chimpanzee human)one is ASF1 that is the line segment from the point (1 1)

to the point (32 32) and the other is BSF1minus4 that is the

line segment from the point (35 31) to the point (125 121)And in addition there are totally 6 FPs in the ADLD whichare FP1(46 46) FP2(66 66) FP3(111 111) FP4(114 114)FP5(115 115) and FP6(123 123) respectively

Observing Figure 2(b) we can easily find that there arealso two SFs in the ADLD of the sequence pair (humangorilla) But different from that in Figure 2(a) the two SFsin Figure 2(b) are both ASFs one is ASF1 that is the linesegment from the point (1 1) to the point (104 104) andthe other is ASF2 that is the line segment from the point(106 106) to the point (121 121) And in addition the twoASFs in Figure 2(b) are separated by one gap and there existno FPs or BSFs on the AT or BTs

Through analysis we can know that for a given proteinsequence pair if there exist some deletions or insertions ofamino acid segments between the two protein sequencesthen there will exist some misalignments of SFs in theirADLD that is some ASFs on the AT will be transformedinto BSFs on some BTs And in addition if there exist somesubstitutions of the amino acids between the two proteinsequences then in their ADLD there will exist some gapsbetween two neighboring SFs or FPs on the AT Furthermoreif there exist some insertions deletions or substitutions of theamino acid segments at the end of the two protein sequencesthen in their ADLD there will exist no SFs or FPs on the ATor BTs

From the above descriptions it is easy to know thatthe ADLD of any given protein sequence pair obtainedby our above proposed method reflects some inner andspecific differences between these two protein sequences inthe given protein sequence pair which may be useful in thesimilaritydissimilarity analysis of protein sequence pairs

3 Results and Discussion

31 Method for SimilarityDissimilarity Analysis of ProteinSequences Based on the ADLDs According to the aboveanalysis we have known that the ADLDs may be useful inanalyzing the differences of the inner structures of proteinsequence pairs In this section we will show how to utilizethe ADLDs to analyze the similaritydissimilarity of a groupof protein sequences

Generally suppose that there are 119873 protein sequencesΨ1 Ψ2 Ψ

119873 then while applying the ADLDs to analyze

the similaritydissimilarity of these119873 sequences the similar-itydissimilarity matrix of these119873 sequences can be obtainedthrough the following steps

Step 1 According to the method given in Section 25 trans-form these 119873 protein sequences into119873 numerical sequences1198781 1198782 119878

119873

Step 2 For a given protein sequence pair Ψ119886 Ψ119887 119886 isin

1 2 119873 119887 isin 1 2 119873 we can obtain their ADLDthrough adopting the method proposed in Section 26 andthen we can obtain all of the SFs (including ASFs and BSFs)and FPs in the ADLD Hence we can obtain the lengths oftheseASFs the lengths of these BSFs and the number of theseFPs respectively

Step 3 Suppose that there are totally 1198711different ASFs

such as ASF1ASF2 ASF1198711 1198712different BSFs such

as BSF11198971

BSF21198972

BSF11987121198971198712

and 1198713different FPs such as

FP1 FP2 FP1198713 in the ADLD And in addition for eachASF119894 and BSF119895

119897119895

let their length be length119894 and length119895

respectively where 119894 isin 1 2 1198711 and 119895 isin 1 2 119871

2

then we can define the similarity degree (SD) of Ψ119886 Ψ119887 as

follows

SD (Ψ119886 Ψ119887) =

1198711

sum

119894=1

length119894 +1198712

sum

119895=1

length119895+ 1198713 (17)

Computational and Mathematical Methods in Medicine 7

150

100

50

0150100500

(a)

150

100

50

0150100500

(b)

Figure 1 (a)TheASD of the 120573-globin protein sequence pair (chimpanzee human) with 120585 = 12 (b) the ASD of the 120573-globin protein sequencepair (human gorilla) with 120585 = 16

150

100

50

0150100500

(a)

150

100

50

0

150100500

(b)

Figure 2 (a) The ADLD of the protein sequence pair (chimpanzee human) (b) the ADLD of the protein sequence pair (human gorilla)

And therefore according to these 119873 protein sequencesΨ1 Ψ2 Ψ

119873 we can obtain an 119873 times 119873 matching matrix

(MM) as follows

MM = [119889119894119895]119873times119873

(18)

where

119889119894119895

=

SD (Ψ119894 Ψ119895) if 119894 ge 119895

0 else

for 119894 isin 1 2 119873 119895 isin 1 2 119873

(19)

Step 4 Based on the matching matrixMM and all of its com-ponents 119889

119894119895 where 119894 isin 1 2 119873 and 119895 isin 1 2 119873 then

we can obtain an 119873 times 119873 similaritydissimilarity matrix (SM)of these 119873 protein sequences Ψ

1 Ψ2 Ψ

119873 as follows

SM = [119904119894119895]119873times119873

(20)

where

119904119894119895

=

1 minus Λ(

119889119894119895

119889119894119894

) if 119894 ge 119895

0 else

Λ(

119889119894119895

119889119894119894

) =

1 if119889119894119895

119889119894119894

ge 1

119889119894119895

119889119894119894

else

for 119894 isin 1 2 119873 119895 isin 1 2 119873

(21)

According to the above steps we present an examplethrough implementing the ADLDs to analyze the similar-itydissimilarity of 16 ND5 proteins (illustrated in Table 5)while letting 120575 = 3 and illustrate the results of similar-itydissimilarity matrix in Table 6

Observing Table 6 it is easy to find that there are somesimilar pairs such as (c-chim pi-chim) with the distance00510 (human c-chim) with the distance 00814 (human

8 Computational and Mathematical Methods in Medicine

Table 5 The basic information of 16 ND5 protein sequences

Number Name Abbreviation Access number Length1 Human Human ADT804301 6032 Gorilla Gorilla NP 008222 6033 Pigmy chimpanzee Pi-chim NP 008209 6034 Common chimpanzee C-chim NP 008196 6035 Fin-whale Fin-whale NP 006899 6066 Blue-whale Blue-whale NP 007066 6067 Rat Rat AP 0049021 6108 Mouse Mouse NP 904338 6079 Opossum Opossum NP 007105 60210 Sheep Sheep ABW229031 60611 Goat Goat BAN592581 60612 Lemur Lemur CAD134311 60313 Cattle Cattle ADN119021 60614 Hare Hare CAD132911 60315 Gallus Gallus BAE160361 60516 Rabbit Rabbit NP 0075591 603

Sheep

Goat

Cattle

Fin-whale

Blue-whale

Lemur

Hare

Rabbit

Gorilla

Human

Pi-chim

C-chim

Rat

Mouse

Opossum

Gallus

00383

00468

00255

00255

00162

00162

00942

00942

02169

00356

00356

01084

00540

00419

02311

00419

00855

00128

00184

00085

00665

01080

00477

00770

00243

00176

00288

00163

00458

00142

000005010015020

Figure 3 The phylogenetic tree of the 16 species based on the ADLDs based method

pi-chim) with the distance 00720 (gorilla c-chim) with thedistance 00865 (gorilla pi-chim) with the distance 00833and (fin-whale blue-whale) with the distance 00324 Andamong them the opossum seems to be a peculiar mammalsince the shortest distance between it and the remaining

mammals is more than 04023 Obviously the result isconsistent with the fact that opossum is the most remotespecies from the remaining mammals

Additionally gallus seems to be more peculiar thanopossum since the shortest distance between it and

Computational and Mathematical Methods in Medicine 9

Table6Th

esim

ilaritydissim

ilaritymatrix

forthe

16ND5proteins

basedon

theA

DLD

sbased

metho

d

Hum

anGorilla

Pi-chim

C-chim

Fin-whale

Blue-w

hale

rat

mou

seop

ossum

sheep

goat

lemur

cattle

hare

gallu

srabbit

Hum

an000

00Gorilla

01111

00000

Pi-chim

007

20008

33000

00C-

chim

008

14008

65005

10000

00Fin-whale

03396

03285

03222

03301

00000

Blue-w

hale

03474

03333

03285

03301

00324

000

00Ra

t03693

03622

03636

03716

03333

03381

000

00Mou

se03740

03686

03716

03748

03317

03333

01883

000

00Opo

ssum

044

7604551

04290

044

1804515

04519

04513

044

79000

00Sheep

03020

02933

02871

02951

02023

02067

03149

03219

04121

00000

Goat

03036

02901

02871

02919

01958

02147

03166

03468

04202

007

12000

00Lemur

02989

02708

02839

02967

02557

02724

03166

03670

040

5502087

02317

000

00Ca

ttle

03114

03045

03046

03062

01958

02051

03149

03173

04235

009

0601254

02184

000

00Hare

03146

03157

03062

03046

02832

02788

03166

03421

04023

02217

02508

02053

02532

000

00Gallus

04726

04920

04737

05008

044

50044

2304903

04743

04691

04239

04524

046

8004183

046

6000000

Rabb

it03255

03189

03222

03142

02896

02756

03084

03390

04332

02184

02603

02282

02612

008

37044

34000

00

10 Computational and Mathematical Methods in Medicine

the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal

Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences

32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species

Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3

From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree

To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7

For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points

Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)

Table 7 The basic information of 29 spike proteins

Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255

From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method

Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively

Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang

33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we

Computational and Mathematical Methods in Medicine 11

BCoVE

BCoVL

BCoVM

BCoVQ

HCoVOC43

MHVJHM

MHVP

MHVA

MHVM

TGEVG

TGEV

PEDVC

PEDV

TOR2

TaiwanTC1

FRA

BJ01

GD01

GZ02

civet007

A022

civet010

PC4137

GD03T0013

PC4127

PC4205

IBV

IBVBJ

IBVC

00000

00000

00000

00000

00369

00026

00026

00022

00022

00019

00559

00221

00019

00950

00950

01221

00010

00004

00015

00004

00004

00006

00004

00012

00012

00011

00006

00004

00004

04350

04350

00002

00002

00006

00005

00020

00005

00019

00018

00011

00202

00116

00112

00044

00040

04560

00232

00337

03498

03309

0027203461

00713

00230

00049

00052

Group II

Group I

Group III

SARS CoVs

Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5

will further discuss the intuition and visuality of the ADLDsin detail

From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3

Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)

Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total

12 Computational and Mathematical Methods in Medicine

PEDVC

PEDV

GD03T0013

BJ01

GD01

MHVA

PC4205

GZ02

TOR2

MHVM

PC4137

FRA

PC4127

TaiwanTC1

MHVP

A022

civet007

civet010

TGEVG

TGEV

BCoVM

BCoVQ

HCoVOC43

IBVC

MHVJHM

BCoVE

BCoVL

IBVBJ

IBV

00000

00000

00000

00000

00013

00006

00006

00001

00001

00000

00012

00002

00001

00010

00013

00010

00000

00000

00000

00000

00000

00000

00001

00001

00000

00000

00000

00001

0000000000

00000

00000

00006

00000

00000

00001

00000

00000

00001

00002

00001

00000

00002

00004

00002

00003

00003

00008

00006

00008

00067

00022

00013

00013

00008

00042

Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang

length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short

And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively

Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs

Computational and Mathematical Methods in Medicine 13

Fin-whale

Rabbit

Blue-whale

Sheep

Goat

Hare

Rat

Lemur

Mouse

Cattle

Opossum

Gorilla

Human

Pi-chim

C-chim

Gallus

00006

00037

00004

00004

00001

00002

00000

00001

00074

00002

00015

00000

00001

00011

00183

00001

00006

00006

00004

00005

00002

00005

00031

00008

00024

00020

00039

00060

00024

00086

Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang

700

600

500

400

300

200

100

07006005004003002001000

(a)

700

600

500

400

300

200

100

07006005004003002001000

(b)

700

600

500

400

300

200

100

07006005004003002001000

(c)

Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)

14 Computational and Mathematical Methods in Medicine

in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)

Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields

4 Conclusions

In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)

References

[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999

[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994

[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004

[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004

[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007

[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007

[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008

[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001

[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009

[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009

[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006

[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007

[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011

[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009

[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007

[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008

[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009

[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010

[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010

[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008

[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012

Computational and Mathematical Methods in Medicine 15

[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012

[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012

[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008

[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002

[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004

[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005

[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013

[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010

[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012

[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013

[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014

[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004

[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988

[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004

[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002

[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000

[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996

[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical

Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000

[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003

[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013

Page 4: ADLD: A Novel Graphical Representation of Protein

4 Computational and Mathematical Methods in Medicine

Table 2 The 9 eigenvalues (120582) of R and the contribution rates (CR)and the accumulative contribution rates (ACR) of the 9 principalcomponents obtained by conducting PCA of the 20 amino acids

Number 120582 CR ACR1 32237 03582 035822 19132 02126 057083 14048 01561 072694 11876 01320 085885 04959 00551 091396 04467 00496 096357 01992 00221 098578 01218 00135 099929 00071 00008 10000

Table 3The 4 eigenvectors 1198861 1198862 1198863 1198864 corresponding to the first

4 eigenvalues in Table 2

1198861

1198862

1198863

1198864

05036 01436 00571 02158minus02454 minus01875 02304 06547minus01634 01820 06298 02288minus03101 minus01883 minus03964 0507100702 06464 minus00786 00532minus01665 04465 minus05280 01877minus03872 03931 01003 minus00532minus04377 01844 02544 minus0227304349 02643 01738 03495

24 PCA of the Amino Acids Observing Table 1 if weconsider the 20 amino acids as 20 different samples and the9 properties of each amino acid as its 9 components thenaccording to the general steps of conducting PCA illustratedin Section 23 we can obtain a 20 times 9 matrix X and itsstandardized matrix Xlowast a 9 times 9 correlation matrix R and9 principal components 119865

1 1198652 119865

9 And therefore as

illustrated in Table 2 we can obtain the 9 eigenvalues of Rand the contribution rates and the accumulative contributionrates of the 9 principal components 119865

1 1198652 119865

9 respec-

tivelyFrom Table 2 we can see that the accumulative con-

tribution rate of the first 4 principal components amountsto 08588 (=8588) which is already bigger than 85Therefore we can keep the first 4 principal components onlyLet 120582

1 1205822 1205823 1205824 be the 4 eigenvalues corresponding to the

first 4 principal components respectively then as illustratedin Table 3 we can obtain the 4 eigenvectors 119886

1 1198862 1198863 1198864

corresponding to the 4 eigenvalues 1205821 1205822 1205823 1205824 separately

Based on Table 3 we can obtain the first 4 principalcomponents 119865

1 1198652 1198653 1198654 as follows

1198651= 05036119883

lowast

1minus 02454119883

lowast

2minus 01634119883

lowast

3

minus 03101119883lowast

4+ 00702119883

lowast

5minus 01665119883

lowast

6

minus 03872119883lowast

7minus 04377119883

lowast

8+ 04349119883

lowast

9

Table 4 The total scores of the 20 amino acids

Symbols of amino acids Total scoresA minus09324C minus05985D minus06709E minus02296F 04298G minus11780H 04476I 01435K 07868L minus01205M 05735N minus00242P minus09822Q 02848R 11169S minus07077T minus04525V minus02643W 14729Y 09050

1198652= 01436119883

lowast

1minus 01875119883

lowast

2+ 01820119883

lowast

3

minus 01883119883lowast

4+ 06464119883

lowast

5+ 04465119883

lowast

6

+ 03931119883lowast

7+ 01844119883

lowast

8+ 02643119883

lowast

9

1198653= 00571119883

lowast

1+ 02304119883

lowast

2+ 06298119883

lowast

3

minus 03964119883lowast

4minus 00786119883

lowast

5minus 05280119883

lowast

6

+ 01003119883lowast

7+ 02544119883

lowast

8+ 01738119883

lowast

9

1198654= 02158119883

lowast

1+ 06547119883

lowast

2+ 02288119883

lowast

3

+ 05071119883lowast

4+ 00532119883

lowast

5+ 01877119883

lowast

6

minus 00532119883lowast

7minus 02273119883

lowast

8+ 03495119883

lowast

9

(10)

Observing the above 4 formulas it is easy to find thatthere are three big coefficients in the first formula whichare 05036 (corresponding to mW) 04377 (corresponding to119865) and 04349 (corresponding to vR) respectivelyThereforeit means that the three properties such as mW 119865 andvR will have a major role in the first principal component1198651 Similarly we can also know that the three properties

such as pI 119878 and cN will have a major role in the secondprincipal component 119865

2 the third principal component 119865

3is

mainly determined by pK1 and 119878 and the fourth principalcomponent 119865

4is closely linked with hI and pK2 and so forth

Hence we can obtain the total scores of the 20 amino acids asillustrated in Table 4 according to formula (9)

Computational and Mathematical Methods in Medicine 5

25 Numerical Sequences of Protein Sequences LetΩ = AC

DE FGH IK LMNPQR STVWY and supposethat Ψ = 119901

111990121199013 119901119873represents a protein sequence with

119873 amino acids where 119901119894

isin Ω for 119894 isin 1 2 119873 thenwe can obtain a numerical sequence 119878

Ψ= (1199051 1199052 119905

119873)

corresponding to the protein sequence Ψ through replacingeach amino acid 119901

119894in Ψ with its corresponding value of

TotalScore(119894) for 119894 isin 1 2 119873For example consider the following 3 abbreviated protein

sequences

Hu = MTMHTTMTTLGor = MTMYATMTTLOpo = MKVINISNTM

According to the above descriptions and Table 4 thenwe can obtain their corresponding numerical sequences asfollows119878Hu = 05735 minus04525 05735 04476 minus04525 minus04525

05735 minus04525 minus04525 minus01205

119878Gor = 05735 minus04525 05735 09050 minus09324 minus04525

05735 minus04525 minus04525 minus01205

119878Opo = 05735 07868 minus02643 01435 minus00242 01435

minus07077 minus00242 minus04525 05735

(11)

26 ASDs and ADLDs of Protein Sequence Pairs For agiven protein sequence pair (119904

1 1199042) suppose that the protein

sequence 1199041includes 119873

1amino acids 119904

2includes 119873

2amino

acids and 1198731

⩾ 1198732 then in order to measure the

similaritydissimilarity between them in this section we willpresent a new method called Alignment Scatter Diagram(ASD) to plot the two sequences into a scatter diagram firstAnd for convenience we call the points in the ASD thealignment-plots (APs) The ASD of the protein sequence pair(1199041 1199042) can be obtained through the following steps

Step 1 According to the method given in Section 25 trans-late the protein sequence pair (119904

1 1199042) into two numerical

sequences with the same length as follows

1198781= 1199051 1199052 119905

1198731

1198782= 1199051 1199052 119905

1198732

0 0 0⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

1198731minus1198732

(12)

Step 2 Let 119908 be the alignment width (AW) of the proteinsequence pair (119904

1 1199042) that is let 119904

1= 1199011 1199012 1199013 119901

1198731

1199042=

1199021 1199022 1199023 119902

1198732

then for any amino acid 119901119894in the protein

sequence 1199041 we will compare it with these 2119908 + 1 amino

acids 119902119894minus119908

119902119894minus1

119902119894 119902119894+1

119902119894+119908

in the protein sequence1199042 and then 119908 can be simply defined as follows

119908 =

120585 if 1198731minus 1198732le 120585

1198731minus 1198732 else

(13)

where 120585 gt 0 is a given threshold to guarantee that the AWof the protein sequence pair (119904

1 1199042) will not be too small to

expose the association of the inner structures of the proteinsequence pair (119904

1 1199042) In actual applications we suggest that

120585 shall be no less than 10

Step 3 Let 120576 gt 0 be the dissimilarity degree (DD) of twoamino acids that is if 120576 = 0 then it means that the two aminoacids are the same otherwise it means that the two aminoacids are different from each other to some degree and thenthe APs in the ASD of the protein sequence pair (119904

1 1199042) can

be briefly defined as follows

119860120576

119894119895= Θ (

10038161003816100381610038161003816119905119894minus 119905119895

10038161003816100381610038161003816minus 120576) (14)

where 119894 isin 1 2 1198731 119895 isin 1 2 119873

1 and Θ is a

Heaviside function which can be defined as follows

Θ (119909) =

1 if 119909 le 0

0 else(15)

Thereafter we can obtain an 1198731times 1198731alignment matrix

(AM) as follows

AM = (119860120576

119894119895)1198731times1198731

(16)

Step 4 For the1198731times1198731elements in the alignmentmatrix AM

we can plot points on 119894-119895 plane for these elements in the AMwith 119860

120576

119894119895= 1 and |119894 minus 119895| le 119908 And for convenience we call

the obtained graph the Alignment Scatter Diagram (ASD) ofthe protein sequence pair (119904

1 1199042)

For example considering the three 120573-globin pro-tein sequences of chimpanzee [GenBank AAA163341]human [GenBank CAA262041] and gorilla [GenBankCAA434211] obtained from the GenBank respectively weillustrate the ASDs of the 120573-globin protein sequence pair(chimpanzee human) and the 120573-globin protein sequencepair (human gorilla) in Figures 1(a) and 1(b) separately whileletting 120576 = 0

From Figure 1 it is easy to see that there are lots of disor-dered points in these ASDs which will lower the visuality ofthe ASDs remarkably and obstruct us fromdistinguishing thesimilaritydissimilarity between the protein sequence pairsintuitively while observing these ASDsTherefore in order toimprove the intuition of theASD wewill propose a simplifiedvariant diagram of the ASD which is called the AlignmentDiagonal Line Diagram (ADLD)

For convenience in an ASD we call its main diagonalline the artery tracks (ATs) and the lines parallelling to itsmain diagonal line the by-path tracks (BTs) respectively Andin addition we define a set consisting with no less than 120575

consecutive APs on the AT or BTs as a CAPS where 120575 ge 1

is a given thresholdFor a given CAPS caps

1 if there is no CAPS caps

2

satisfying caps1

sub caps2 then we call the caps

1a maximum

CAPS And for convenience we call the line formed byconnecting all of the APs in a maximum CAPS a similarfragment (SF) and simultaneously we call all of the APs onthe AT but not on any SFs the free points (FPs)

6 Computational and Mathematical Methods in Medicine

Obviously in an ASD if keeping all of the SFs and FPsonly and omitting all those other APs then we will obtain asimplified variant diagram of the ASD and for conveniencewe call it the Alignment Diagonal Line Diagram (ADLD)Apparently if 120575 = 1 then an ADLD will degenerate into anASD Therefore in actual applications we suggest that 120575 willbe no less than 2 And particularly in order to find moreaccurate SFs in the ADLD of a protein sequence pair thelonger the protein sequences in the protein sequence pair arethe bigger the value of 120575 shall be

For convenience of analysis in an ADLD suppose thatthere are 119870

1different SFs and 119870

2different FPs on its AT

119870 different BTs locating above its AT and 119870 different BTslocating below its AT then we get the following

(1) For these 1198701different SFs and 119870

2different FPs

on the AT of the ADLD we will number these1198701SFs and 119870

2FPs from left to right and utilize

ASF1ASF2 ASF1198701 and FP1 FP2 FP1198702 torepresent these 119870

1SFs and 119870

2FPs separately And

in addition we would also call these SFs on the AT ofthe ADLD the ASFs

(2) For these 119870 different BTs locating above the AT wewill number these BTs from down to up and utilizeBT1BT2 BT

119870 to represent these BTs separately

and for these 119870 different BTs locating below theAT we will number these BTs from up to down andutilize BT

minus1BTminus2

BTminus119870

to represent these BTsseparately

(3) For each BT119897 where 119897 isin 1 2 119870 suppose

that there are 1198703different SFs on the BT

119897 then we

will number these 1198703SFs from left to right and

utilize BSF1119897BSF2119897 BSF1198703

119897 to represent these SFs

separately And in addition we would also call theseSFs on the BTs of the ADLD the BSFs

According to the above assumptions in Figure 2 we showthe two ADLDs corresponding to the ASDs illustrated inFigures 1(a) and 1(b) while letting 120575 = 3 And in additionto make the ADLDs more visual and intuitional in Figure 2we use the red ldquolowastrdquo to represent the FPs on the AT and the bluelines to represent the SFs on the AT or BTs

From Figure 2(a) it is easy to see that there are two SFsin the ADLD of the sequence pair (chimpanzee human)one is ASF1 that is the line segment from the point (1 1)

to the point (32 32) and the other is BSF1minus4 that is the

line segment from the point (35 31) to the point (125 121)And in addition there are totally 6 FPs in the ADLD whichare FP1(46 46) FP2(66 66) FP3(111 111) FP4(114 114)FP5(115 115) and FP6(123 123) respectively

Observing Figure 2(b) we can easily find that there arealso two SFs in the ADLD of the sequence pair (humangorilla) But different from that in Figure 2(a) the two SFsin Figure 2(b) are both ASFs one is ASF1 that is the linesegment from the point (1 1) to the point (104 104) andthe other is ASF2 that is the line segment from the point(106 106) to the point (121 121) And in addition the twoASFs in Figure 2(b) are separated by one gap and there existno FPs or BSFs on the AT or BTs

Through analysis we can know that for a given proteinsequence pair if there exist some deletions or insertions ofamino acid segments between the two protein sequencesthen there will exist some misalignments of SFs in theirADLD that is some ASFs on the AT will be transformedinto BSFs on some BTs And in addition if there exist somesubstitutions of the amino acids between the two proteinsequences then in their ADLD there will exist some gapsbetween two neighboring SFs or FPs on the AT Furthermoreif there exist some insertions deletions or substitutions of theamino acid segments at the end of the two protein sequencesthen in their ADLD there will exist no SFs or FPs on the ATor BTs

From the above descriptions it is easy to know thatthe ADLD of any given protein sequence pair obtainedby our above proposed method reflects some inner andspecific differences between these two protein sequences inthe given protein sequence pair which may be useful in thesimilaritydissimilarity analysis of protein sequence pairs

3 Results and Discussion

31 Method for SimilarityDissimilarity Analysis of ProteinSequences Based on the ADLDs According to the aboveanalysis we have known that the ADLDs may be useful inanalyzing the differences of the inner structures of proteinsequence pairs In this section we will show how to utilizethe ADLDs to analyze the similaritydissimilarity of a groupof protein sequences

Generally suppose that there are 119873 protein sequencesΨ1 Ψ2 Ψ

119873 then while applying the ADLDs to analyze

the similaritydissimilarity of these119873 sequences the similar-itydissimilarity matrix of these119873 sequences can be obtainedthrough the following steps

Step 1 According to the method given in Section 25 trans-form these 119873 protein sequences into119873 numerical sequences1198781 1198782 119878

119873

Step 2 For a given protein sequence pair Ψ119886 Ψ119887 119886 isin

1 2 119873 119887 isin 1 2 119873 we can obtain their ADLDthrough adopting the method proposed in Section 26 andthen we can obtain all of the SFs (including ASFs and BSFs)and FPs in the ADLD Hence we can obtain the lengths oftheseASFs the lengths of these BSFs and the number of theseFPs respectively

Step 3 Suppose that there are totally 1198711different ASFs

such as ASF1ASF2 ASF1198711 1198712different BSFs such

as BSF11198971

BSF21198972

BSF11987121198971198712

and 1198713different FPs such as

FP1 FP2 FP1198713 in the ADLD And in addition for eachASF119894 and BSF119895

119897119895

let their length be length119894 and length119895

respectively where 119894 isin 1 2 1198711 and 119895 isin 1 2 119871

2

then we can define the similarity degree (SD) of Ψ119886 Ψ119887 as

follows

SD (Ψ119886 Ψ119887) =

1198711

sum

119894=1

length119894 +1198712

sum

119895=1

length119895+ 1198713 (17)

Computational and Mathematical Methods in Medicine 7

150

100

50

0150100500

(a)

150

100

50

0150100500

(b)

Figure 1 (a)TheASD of the 120573-globin protein sequence pair (chimpanzee human) with 120585 = 12 (b) the ASD of the 120573-globin protein sequencepair (human gorilla) with 120585 = 16

150

100

50

0150100500

(a)

150

100

50

0

150100500

(b)

Figure 2 (a) The ADLD of the protein sequence pair (chimpanzee human) (b) the ADLD of the protein sequence pair (human gorilla)

And therefore according to these 119873 protein sequencesΨ1 Ψ2 Ψ

119873 we can obtain an 119873 times 119873 matching matrix

(MM) as follows

MM = [119889119894119895]119873times119873

(18)

where

119889119894119895

=

SD (Ψ119894 Ψ119895) if 119894 ge 119895

0 else

for 119894 isin 1 2 119873 119895 isin 1 2 119873

(19)

Step 4 Based on the matching matrixMM and all of its com-ponents 119889

119894119895 where 119894 isin 1 2 119873 and 119895 isin 1 2 119873 then

we can obtain an 119873 times 119873 similaritydissimilarity matrix (SM)of these 119873 protein sequences Ψ

1 Ψ2 Ψ

119873 as follows

SM = [119904119894119895]119873times119873

(20)

where

119904119894119895

=

1 minus Λ(

119889119894119895

119889119894119894

) if 119894 ge 119895

0 else

Λ(

119889119894119895

119889119894119894

) =

1 if119889119894119895

119889119894119894

ge 1

119889119894119895

119889119894119894

else

for 119894 isin 1 2 119873 119895 isin 1 2 119873

(21)

According to the above steps we present an examplethrough implementing the ADLDs to analyze the similar-itydissimilarity of 16 ND5 proteins (illustrated in Table 5)while letting 120575 = 3 and illustrate the results of similar-itydissimilarity matrix in Table 6

Observing Table 6 it is easy to find that there are somesimilar pairs such as (c-chim pi-chim) with the distance00510 (human c-chim) with the distance 00814 (human

8 Computational and Mathematical Methods in Medicine

Table 5 The basic information of 16 ND5 protein sequences

Number Name Abbreviation Access number Length1 Human Human ADT804301 6032 Gorilla Gorilla NP 008222 6033 Pigmy chimpanzee Pi-chim NP 008209 6034 Common chimpanzee C-chim NP 008196 6035 Fin-whale Fin-whale NP 006899 6066 Blue-whale Blue-whale NP 007066 6067 Rat Rat AP 0049021 6108 Mouse Mouse NP 904338 6079 Opossum Opossum NP 007105 60210 Sheep Sheep ABW229031 60611 Goat Goat BAN592581 60612 Lemur Lemur CAD134311 60313 Cattle Cattle ADN119021 60614 Hare Hare CAD132911 60315 Gallus Gallus BAE160361 60516 Rabbit Rabbit NP 0075591 603

Sheep

Goat

Cattle

Fin-whale

Blue-whale

Lemur

Hare

Rabbit

Gorilla

Human

Pi-chim

C-chim

Rat

Mouse

Opossum

Gallus

00383

00468

00255

00255

00162

00162

00942

00942

02169

00356

00356

01084

00540

00419

02311

00419

00855

00128

00184

00085

00665

01080

00477

00770

00243

00176

00288

00163

00458

00142

000005010015020

Figure 3 The phylogenetic tree of the 16 species based on the ADLDs based method

pi-chim) with the distance 00720 (gorilla c-chim) with thedistance 00865 (gorilla pi-chim) with the distance 00833and (fin-whale blue-whale) with the distance 00324 Andamong them the opossum seems to be a peculiar mammalsince the shortest distance between it and the remaining

mammals is more than 04023 Obviously the result isconsistent with the fact that opossum is the most remotespecies from the remaining mammals

Additionally gallus seems to be more peculiar thanopossum since the shortest distance between it and

Computational and Mathematical Methods in Medicine 9

Table6Th

esim

ilaritydissim

ilaritymatrix

forthe

16ND5proteins

basedon

theA

DLD

sbased

metho

d

Hum

anGorilla

Pi-chim

C-chim

Fin-whale

Blue-w

hale

rat

mou

seop

ossum

sheep

goat

lemur

cattle

hare

gallu

srabbit

Hum

an000

00Gorilla

01111

00000

Pi-chim

007

20008

33000

00C-

chim

008

14008

65005

10000

00Fin-whale

03396

03285

03222

03301

00000

Blue-w

hale

03474

03333

03285

03301

00324

000

00Ra

t03693

03622

03636

03716

03333

03381

000

00Mou

se03740

03686

03716

03748

03317

03333

01883

000

00Opo

ssum

044

7604551

04290

044

1804515

04519

04513

044

79000

00Sheep

03020

02933

02871

02951

02023

02067

03149

03219

04121

00000

Goat

03036

02901

02871

02919

01958

02147

03166

03468

04202

007

12000

00Lemur

02989

02708

02839

02967

02557

02724

03166

03670

040

5502087

02317

000

00Ca

ttle

03114

03045

03046

03062

01958

02051

03149

03173

04235

009

0601254

02184

000

00Hare

03146

03157

03062

03046

02832

02788

03166

03421

04023

02217

02508

02053

02532

000

00Gallus

04726

04920

04737

05008

044

50044

2304903

04743

04691

04239

04524

046

8004183

046

6000000

Rabb

it03255

03189

03222

03142

02896

02756

03084

03390

04332

02184

02603

02282

02612

008

37044

34000

00

10 Computational and Mathematical Methods in Medicine

the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal

Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences

32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species

Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3

From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree

To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7

For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points

Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)

Table 7 The basic information of 29 spike proteins

Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255

From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method

Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively

Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang

33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we

Computational and Mathematical Methods in Medicine 11

BCoVE

BCoVL

BCoVM

BCoVQ

HCoVOC43

MHVJHM

MHVP

MHVA

MHVM

TGEVG

TGEV

PEDVC

PEDV

TOR2

TaiwanTC1

FRA

BJ01

GD01

GZ02

civet007

A022

civet010

PC4137

GD03T0013

PC4127

PC4205

IBV

IBVBJ

IBVC

00000

00000

00000

00000

00369

00026

00026

00022

00022

00019

00559

00221

00019

00950

00950

01221

00010

00004

00015

00004

00004

00006

00004

00012

00012

00011

00006

00004

00004

04350

04350

00002

00002

00006

00005

00020

00005

00019

00018

00011

00202

00116

00112

00044

00040

04560

00232

00337

03498

03309

0027203461

00713

00230

00049

00052

Group II

Group I

Group III

SARS CoVs

Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5

will further discuss the intuition and visuality of the ADLDsin detail

From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3

Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)

Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total

12 Computational and Mathematical Methods in Medicine

PEDVC

PEDV

GD03T0013

BJ01

GD01

MHVA

PC4205

GZ02

TOR2

MHVM

PC4137

FRA

PC4127

TaiwanTC1

MHVP

A022

civet007

civet010

TGEVG

TGEV

BCoVM

BCoVQ

HCoVOC43

IBVC

MHVJHM

BCoVE

BCoVL

IBVBJ

IBV

00000

00000

00000

00000

00013

00006

00006

00001

00001

00000

00012

00002

00001

00010

00013

00010

00000

00000

00000

00000

00000

00000

00001

00001

00000

00000

00000

00001

0000000000

00000

00000

00006

00000

00000

00001

00000

00000

00001

00002

00001

00000

00002

00004

00002

00003

00003

00008

00006

00008

00067

00022

00013

00013

00008

00042

Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang

length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short

And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively

Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs

Computational and Mathematical Methods in Medicine 13

Fin-whale

Rabbit

Blue-whale

Sheep

Goat

Hare

Rat

Lemur

Mouse

Cattle

Opossum

Gorilla

Human

Pi-chim

C-chim

Gallus

00006

00037

00004

00004

00001

00002

00000

00001

00074

00002

00015

00000

00001

00011

00183

00001

00006

00006

00004

00005

00002

00005

00031

00008

00024

00020

00039

00060

00024

00086

Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang

700

600

500

400

300

200

100

07006005004003002001000

(a)

700

600

500

400

300

200

100

07006005004003002001000

(b)

700

600

500

400

300

200

100

07006005004003002001000

(c)

Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)

14 Computational and Mathematical Methods in Medicine

in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)

Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields

4 Conclusions

In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)

References

[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999

[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994

[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004

[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004

[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007

[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007

[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008

[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001

[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009

[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009

[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006

[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007

[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011

[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009

[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007

[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008

[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009

[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010

[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010

[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008

[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012

Computational and Mathematical Methods in Medicine 15

[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012

[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012

[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008

[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002

[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004

[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005

[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013

[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010

[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012

[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013

[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014

[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004

[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988

[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004

[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002

[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000

[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996

[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical

Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000

[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003

[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013

Page 5: ADLD: A Novel Graphical Representation of Protein

Computational and Mathematical Methods in Medicine 5

25 Numerical Sequences of Protein Sequences LetΩ = AC

DE FGH IK LMNPQR STVWY and supposethat Ψ = 119901

111990121199013 119901119873represents a protein sequence with

119873 amino acids where 119901119894

isin Ω for 119894 isin 1 2 119873 thenwe can obtain a numerical sequence 119878

Ψ= (1199051 1199052 119905

119873)

corresponding to the protein sequence Ψ through replacingeach amino acid 119901

119894in Ψ with its corresponding value of

TotalScore(119894) for 119894 isin 1 2 119873For example consider the following 3 abbreviated protein

sequences

Hu = MTMHTTMTTLGor = MTMYATMTTLOpo = MKVINISNTM

According to the above descriptions and Table 4 thenwe can obtain their corresponding numerical sequences asfollows119878Hu = 05735 minus04525 05735 04476 minus04525 minus04525

05735 minus04525 minus04525 minus01205

119878Gor = 05735 minus04525 05735 09050 minus09324 minus04525

05735 minus04525 minus04525 minus01205

119878Opo = 05735 07868 minus02643 01435 minus00242 01435

minus07077 minus00242 minus04525 05735

(11)

26 ASDs and ADLDs of Protein Sequence Pairs For agiven protein sequence pair (119904

1 1199042) suppose that the protein

sequence 1199041includes 119873

1amino acids 119904

2includes 119873

2amino

acids and 1198731

⩾ 1198732 then in order to measure the

similaritydissimilarity between them in this section we willpresent a new method called Alignment Scatter Diagram(ASD) to plot the two sequences into a scatter diagram firstAnd for convenience we call the points in the ASD thealignment-plots (APs) The ASD of the protein sequence pair(1199041 1199042) can be obtained through the following steps

Step 1 According to the method given in Section 25 trans-late the protein sequence pair (119904

1 1199042) into two numerical

sequences with the same length as follows

1198781= 1199051 1199052 119905

1198731

1198782= 1199051 1199052 119905

1198732

0 0 0⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

1198731minus1198732

(12)

Step 2 Let 119908 be the alignment width (AW) of the proteinsequence pair (119904

1 1199042) that is let 119904

1= 1199011 1199012 1199013 119901

1198731

1199042=

1199021 1199022 1199023 119902

1198732

then for any amino acid 119901119894in the protein

sequence 1199041 we will compare it with these 2119908 + 1 amino

acids 119902119894minus119908

119902119894minus1

119902119894 119902119894+1

119902119894+119908

in the protein sequence1199042 and then 119908 can be simply defined as follows

119908 =

120585 if 1198731minus 1198732le 120585

1198731minus 1198732 else

(13)

where 120585 gt 0 is a given threshold to guarantee that the AWof the protein sequence pair (119904

1 1199042) will not be too small to

expose the association of the inner structures of the proteinsequence pair (119904

1 1199042) In actual applications we suggest that

120585 shall be no less than 10

Step 3 Let 120576 gt 0 be the dissimilarity degree (DD) of twoamino acids that is if 120576 = 0 then it means that the two aminoacids are the same otherwise it means that the two aminoacids are different from each other to some degree and thenthe APs in the ASD of the protein sequence pair (119904

1 1199042) can

be briefly defined as follows

119860120576

119894119895= Θ (

10038161003816100381610038161003816119905119894minus 119905119895

10038161003816100381610038161003816minus 120576) (14)

where 119894 isin 1 2 1198731 119895 isin 1 2 119873

1 and Θ is a

Heaviside function which can be defined as follows

Θ (119909) =

1 if 119909 le 0

0 else(15)

Thereafter we can obtain an 1198731times 1198731alignment matrix

(AM) as follows

AM = (119860120576

119894119895)1198731times1198731

(16)

Step 4 For the1198731times1198731elements in the alignmentmatrix AM

we can plot points on 119894-119895 plane for these elements in the AMwith 119860

120576

119894119895= 1 and |119894 minus 119895| le 119908 And for convenience we call

the obtained graph the Alignment Scatter Diagram (ASD) ofthe protein sequence pair (119904

1 1199042)

For example considering the three 120573-globin pro-tein sequences of chimpanzee [GenBank AAA163341]human [GenBank CAA262041] and gorilla [GenBankCAA434211] obtained from the GenBank respectively weillustrate the ASDs of the 120573-globin protein sequence pair(chimpanzee human) and the 120573-globin protein sequencepair (human gorilla) in Figures 1(a) and 1(b) separately whileletting 120576 = 0

From Figure 1 it is easy to see that there are lots of disor-dered points in these ASDs which will lower the visuality ofthe ASDs remarkably and obstruct us fromdistinguishing thesimilaritydissimilarity between the protein sequence pairsintuitively while observing these ASDsTherefore in order toimprove the intuition of theASD wewill propose a simplifiedvariant diagram of the ASD which is called the AlignmentDiagonal Line Diagram (ADLD)

For convenience in an ASD we call its main diagonalline the artery tracks (ATs) and the lines parallelling to itsmain diagonal line the by-path tracks (BTs) respectively Andin addition we define a set consisting with no less than 120575

consecutive APs on the AT or BTs as a CAPS where 120575 ge 1

is a given thresholdFor a given CAPS caps

1 if there is no CAPS caps

2

satisfying caps1

sub caps2 then we call the caps

1a maximum

CAPS And for convenience we call the line formed byconnecting all of the APs in a maximum CAPS a similarfragment (SF) and simultaneously we call all of the APs onthe AT but not on any SFs the free points (FPs)

6 Computational and Mathematical Methods in Medicine

Obviously in an ASD if keeping all of the SFs and FPsonly and omitting all those other APs then we will obtain asimplified variant diagram of the ASD and for conveniencewe call it the Alignment Diagonal Line Diagram (ADLD)Apparently if 120575 = 1 then an ADLD will degenerate into anASD Therefore in actual applications we suggest that 120575 willbe no less than 2 And particularly in order to find moreaccurate SFs in the ADLD of a protein sequence pair thelonger the protein sequences in the protein sequence pair arethe bigger the value of 120575 shall be

For convenience of analysis in an ADLD suppose thatthere are 119870

1different SFs and 119870

2different FPs on its AT

119870 different BTs locating above its AT and 119870 different BTslocating below its AT then we get the following

(1) For these 1198701different SFs and 119870

2different FPs

on the AT of the ADLD we will number these1198701SFs and 119870

2FPs from left to right and utilize

ASF1ASF2 ASF1198701 and FP1 FP2 FP1198702 torepresent these 119870

1SFs and 119870

2FPs separately And

in addition we would also call these SFs on the AT ofthe ADLD the ASFs

(2) For these 119870 different BTs locating above the AT wewill number these BTs from down to up and utilizeBT1BT2 BT

119870 to represent these BTs separately

and for these 119870 different BTs locating below theAT we will number these BTs from up to down andutilize BT

minus1BTminus2

BTminus119870

to represent these BTsseparately

(3) For each BT119897 where 119897 isin 1 2 119870 suppose

that there are 1198703different SFs on the BT

119897 then we

will number these 1198703SFs from left to right and

utilize BSF1119897BSF2119897 BSF1198703

119897 to represent these SFs

separately And in addition we would also call theseSFs on the BTs of the ADLD the BSFs

According to the above assumptions in Figure 2 we showthe two ADLDs corresponding to the ASDs illustrated inFigures 1(a) and 1(b) while letting 120575 = 3 And in additionto make the ADLDs more visual and intuitional in Figure 2we use the red ldquolowastrdquo to represent the FPs on the AT and the bluelines to represent the SFs on the AT or BTs

From Figure 2(a) it is easy to see that there are two SFsin the ADLD of the sequence pair (chimpanzee human)one is ASF1 that is the line segment from the point (1 1)

to the point (32 32) and the other is BSF1minus4 that is the

line segment from the point (35 31) to the point (125 121)And in addition there are totally 6 FPs in the ADLD whichare FP1(46 46) FP2(66 66) FP3(111 111) FP4(114 114)FP5(115 115) and FP6(123 123) respectively

Observing Figure 2(b) we can easily find that there arealso two SFs in the ADLD of the sequence pair (humangorilla) But different from that in Figure 2(a) the two SFsin Figure 2(b) are both ASFs one is ASF1 that is the linesegment from the point (1 1) to the point (104 104) andthe other is ASF2 that is the line segment from the point(106 106) to the point (121 121) And in addition the twoASFs in Figure 2(b) are separated by one gap and there existno FPs or BSFs on the AT or BTs

Through analysis we can know that for a given proteinsequence pair if there exist some deletions or insertions ofamino acid segments between the two protein sequencesthen there will exist some misalignments of SFs in theirADLD that is some ASFs on the AT will be transformedinto BSFs on some BTs And in addition if there exist somesubstitutions of the amino acids between the two proteinsequences then in their ADLD there will exist some gapsbetween two neighboring SFs or FPs on the AT Furthermoreif there exist some insertions deletions or substitutions of theamino acid segments at the end of the two protein sequencesthen in their ADLD there will exist no SFs or FPs on the ATor BTs

From the above descriptions it is easy to know thatthe ADLD of any given protein sequence pair obtainedby our above proposed method reflects some inner andspecific differences between these two protein sequences inthe given protein sequence pair which may be useful in thesimilaritydissimilarity analysis of protein sequence pairs

3 Results and Discussion

31 Method for SimilarityDissimilarity Analysis of ProteinSequences Based on the ADLDs According to the aboveanalysis we have known that the ADLDs may be useful inanalyzing the differences of the inner structures of proteinsequence pairs In this section we will show how to utilizethe ADLDs to analyze the similaritydissimilarity of a groupof protein sequences

Generally suppose that there are 119873 protein sequencesΨ1 Ψ2 Ψ

119873 then while applying the ADLDs to analyze

the similaritydissimilarity of these119873 sequences the similar-itydissimilarity matrix of these119873 sequences can be obtainedthrough the following steps

Step 1 According to the method given in Section 25 trans-form these 119873 protein sequences into119873 numerical sequences1198781 1198782 119878

119873

Step 2 For a given protein sequence pair Ψ119886 Ψ119887 119886 isin

1 2 119873 119887 isin 1 2 119873 we can obtain their ADLDthrough adopting the method proposed in Section 26 andthen we can obtain all of the SFs (including ASFs and BSFs)and FPs in the ADLD Hence we can obtain the lengths oftheseASFs the lengths of these BSFs and the number of theseFPs respectively

Step 3 Suppose that there are totally 1198711different ASFs

such as ASF1ASF2 ASF1198711 1198712different BSFs such

as BSF11198971

BSF21198972

BSF11987121198971198712

and 1198713different FPs such as

FP1 FP2 FP1198713 in the ADLD And in addition for eachASF119894 and BSF119895

119897119895

let their length be length119894 and length119895

respectively where 119894 isin 1 2 1198711 and 119895 isin 1 2 119871

2

then we can define the similarity degree (SD) of Ψ119886 Ψ119887 as

follows

SD (Ψ119886 Ψ119887) =

1198711

sum

119894=1

length119894 +1198712

sum

119895=1

length119895+ 1198713 (17)

Computational and Mathematical Methods in Medicine 7

150

100

50

0150100500

(a)

150

100

50

0150100500

(b)

Figure 1 (a)TheASD of the 120573-globin protein sequence pair (chimpanzee human) with 120585 = 12 (b) the ASD of the 120573-globin protein sequencepair (human gorilla) with 120585 = 16

150

100

50

0150100500

(a)

150

100

50

0

150100500

(b)

Figure 2 (a) The ADLD of the protein sequence pair (chimpanzee human) (b) the ADLD of the protein sequence pair (human gorilla)

And therefore according to these 119873 protein sequencesΨ1 Ψ2 Ψ

119873 we can obtain an 119873 times 119873 matching matrix

(MM) as follows

MM = [119889119894119895]119873times119873

(18)

where

119889119894119895

=

SD (Ψ119894 Ψ119895) if 119894 ge 119895

0 else

for 119894 isin 1 2 119873 119895 isin 1 2 119873

(19)

Step 4 Based on the matching matrixMM and all of its com-ponents 119889

119894119895 where 119894 isin 1 2 119873 and 119895 isin 1 2 119873 then

we can obtain an 119873 times 119873 similaritydissimilarity matrix (SM)of these 119873 protein sequences Ψ

1 Ψ2 Ψ

119873 as follows

SM = [119904119894119895]119873times119873

(20)

where

119904119894119895

=

1 minus Λ(

119889119894119895

119889119894119894

) if 119894 ge 119895

0 else

Λ(

119889119894119895

119889119894119894

) =

1 if119889119894119895

119889119894119894

ge 1

119889119894119895

119889119894119894

else

for 119894 isin 1 2 119873 119895 isin 1 2 119873

(21)

According to the above steps we present an examplethrough implementing the ADLDs to analyze the similar-itydissimilarity of 16 ND5 proteins (illustrated in Table 5)while letting 120575 = 3 and illustrate the results of similar-itydissimilarity matrix in Table 6

Observing Table 6 it is easy to find that there are somesimilar pairs such as (c-chim pi-chim) with the distance00510 (human c-chim) with the distance 00814 (human

8 Computational and Mathematical Methods in Medicine

Table 5 The basic information of 16 ND5 protein sequences

Number Name Abbreviation Access number Length1 Human Human ADT804301 6032 Gorilla Gorilla NP 008222 6033 Pigmy chimpanzee Pi-chim NP 008209 6034 Common chimpanzee C-chim NP 008196 6035 Fin-whale Fin-whale NP 006899 6066 Blue-whale Blue-whale NP 007066 6067 Rat Rat AP 0049021 6108 Mouse Mouse NP 904338 6079 Opossum Opossum NP 007105 60210 Sheep Sheep ABW229031 60611 Goat Goat BAN592581 60612 Lemur Lemur CAD134311 60313 Cattle Cattle ADN119021 60614 Hare Hare CAD132911 60315 Gallus Gallus BAE160361 60516 Rabbit Rabbit NP 0075591 603

Sheep

Goat

Cattle

Fin-whale

Blue-whale

Lemur

Hare

Rabbit

Gorilla

Human

Pi-chim

C-chim

Rat

Mouse

Opossum

Gallus

00383

00468

00255

00255

00162

00162

00942

00942

02169

00356

00356

01084

00540

00419

02311

00419

00855

00128

00184

00085

00665

01080

00477

00770

00243

00176

00288

00163

00458

00142

000005010015020

Figure 3 The phylogenetic tree of the 16 species based on the ADLDs based method

pi-chim) with the distance 00720 (gorilla c-chim) with thedistance 00865 (gorilla pi-chim) with the distance 00833and (fin-whale blue-whale) with the distance 00324 Andamong them the opossum seems to be a peculiar mammalsince the shortest distance between it and the remaining

mammals is more than 04023 Obviously the result isconsistent with the fact that opossum is the most remotespecies from the remaining mammals

Additionally gallus seems to be more peculiar thanopossum since the shortest distance between it and

Computational and Mathematical Methods in Medicine 9

Table6Th

esim

ilaritydissim

ilaritymatrix

forthe

16ND5proteins

basedon

theA

DLD

sbased

metho

d

Hum

anGorilla

Pi-chim

C-chim

Fin-whale

Blue-w

hale

rat

mou

seop

ossum

sheep

goat

lemur

cattle

hare

gallu

srabbit

Hum

an000

00Gorilla

01111

00000

Pi-chim

007

20008

33000

00C-

chim

008

14008

65005

10000

00Fin-whale

03396

03285

03222

03301

00000

Blue-w

hale

03474

03333

03285

03301

00324

000

00Ra

t03693

03622

03636

03716

03333

03381

000

00Mou

se03740

03686

03716

03748

03317

03333

01883

000

00Opo

ssum

044

7604551

04290

044

1804515

04519

04513

044

79000

00Sheep

03020

02933

02871

02951

02023

02067

03149

03219

04121

00000

Goat

03036

02901

02871

02919

01958

02147

03166

03468

04202

007

12000

00Lemur

02989

02708

02839

02967

02557

02724

03166

03670

040

5502087

02317

000

00Ca

ttle

03114

03045

03046

03062

01958

02051

03149

03173

04235

009

0601254

02184

000

00Hare

03146

03157

03062

03046

02832

02788

03166

03421

04023

02217

02508

02053

02532

000

00Gallus

04726

04920

04737

05008

044

50044

2304903

04743

04691

04239

04524

046

8004183

046

6000000

Rabb

it03255

03189

03222

03142

02896

02756

03084

03390

04332

02184

02603

02282

02612

008

37044

34000

00

10 Computational and Mathematical Methods in Medicine

the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal

Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences

32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species

Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3

From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree

To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7

For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points

Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)

Table 7 The basic information of 29 spike proteins

Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255

From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method

Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively

Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang

33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we

Computational and Mathematical Methods in Medicine 11

BCoVE

BCoVL

BCoVM

BCoVQ

HCoVOC43

MHVJHM

MHVP

MHVA

MHVM

TGEVG

TGEV

PEDVC

PEDV

TOR2

TaiwanTC1

FRA

BJ01

GD01

GZ02

civet007

A022

civet010

PC4137

GD03T0013

PC4127

PC4205

IBV

IBVBJ

IBVC

00000

00000

00000

00000

00369

00026

00026

00022

00022

00019

00559

00221

00019

00950

00950

01221

00010

00004

00015

00004

00004

00006

00004

00012

00012

00011

00006

00004

00004

04350

04350

00002

00002

00006

00005

00020

00005

00019

00018

00011

00202

00116

00112

00044

00040

04560

00232

00337

03498

03309

0027203461

00713

00230

00049

00052

Group II

Group I

Group III

SARS CoVs

Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5

will further discuss the intuition and visuality of the ADLDsin detail

From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3

Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)

Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total

12 Computational and Mathematical Methods in Medicine

PEDVC

PEDV

GD03T0013

BJ01

GD01

MHVA

PC4205

GZ02

TOR2

MHVM

PC4137

FRA

PC4127

TaiwanTC1

MHVP

A022

civet007

civet010

TGEVG

TGEV

BCoVM

BCoVQ

HCoVOC43

IBVC

MHVJHM

BCoVE

BCoVL

IBVBJ

IBV

00000

00000

00000

00000

00013

00006

00006

00001

00001

00000

00012

00002

00001

00010

00013

00010

00000

00000

00000

00000

00000

00000

00001

00001

00000

00000

00000

00001

0000000000

00000

00000

00006

00000

00000

00001

00000

00000

00001

00002

00001

00000

00002

00004

00002

00003

00003

00008

00006

00008

00067

00022

00013

00013

00008

00042

Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang

length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short

And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively

Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs

Computational and Mathematical Methods in Medicine 13

Fin-whale

Rabbit

Blue-whale

Sheep

Goat

Hare

Rat

Lemur

Mouse

Cattle

Opossum

Gorilla

Human

Pi-chim

C-chim

Gallus

00006

00037

00004

00004

00001

00002

00000

00001

00074

00002

00015

00000

00001

00011

00183

00001

00006

00006

00004

00005

00002

00005

00031

00008

00024

00020

00039

00060

00024

00086

Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang

700

600

500

400

300

200

100

07006005004003002001000

(a)

700

600

500

400

300

200

100

07006005004003002001000

(b)

700

600

500

400

300

200

100

07006005004003002001000

(c)

Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)

14 Computational and Mathematical Methods in Medicine

in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)

Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields

4 Conclusions

In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)

References

[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999

[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994

[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004

[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004

[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007

[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007

[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008

[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001

[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009

[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009

[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006

[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007

[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011

[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009

[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007

[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008

[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009

[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010

[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010

[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008

[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012

Computational and Mathematical Methods in Medicine 15

[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012

[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012

[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008

[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002

[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004

[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005

[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013

[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010

[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012

[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013

[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014

[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004

[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988

[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004

[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002

[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000

[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996

[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical

Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000

[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003

[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013

Page 6: ADLD: A Novel Graphical Representation of Protein

6 Computational and Mathematical Methods in Medicine

Obviously in an ASD if keeping all of the SFs and FPsonly and omitting all those other APs then we will obtain asimplified variant diagram of the ASD and for conveniencewe call it the Alignment Diagonal Line Diagram (ADLD)Apparently if 120575 = 1 then an ADLD will degenerate into anASD Therefore in actual applications we suggest that 120575 willbe no less than 2 And particularly in order to find moreaccurate SFs in the ADLD of a protein sequence pair thelonger the protein sequences in the protein sequence pair arethe bigger the value of 120575 shall be

For convenience of analysis in an ADLD suppose thatthere are 119870

1different SFs and 119870

2different FPs on its AT

119870 different BTs locating above its AT and 119870 different BTslocating below its AT then we get the following

(1) For these 1198701different SFs and 119870

2different FPs

on the AT of the ADLD we will number these1198701SFs and 119870

2FPs from left to right and utilize

ASF1ASF2 ASF1198701 and FP1 FP2 FP1198702 torepresent these 119870

1SFs and 119870

2FPs separately And

in addition we would also call these SFs on the AT ofthe ADLD the ASFs

(2) For these 119870 different BTs locating above the AT wewill number these BTs from down to up and utilizeBT1BT2 BT

119870 to represent these BTs separately

and for these 119870 different BTs locating below theAT we will number these BTs from up to down andutilize BT

minus1BTminus2

BTminus119870

to represent these BTsseparately

(3) For each BT119897 where 119897 isin 1 2 119870 suppose

that there are 1198703different SFs on the BT

119897 then we

will number these 1198703SFs from left to right and

utilize BSF1119897BSF2119897 BSF1198703

119897 to represent these SFs

separately And in addition we would also call theseSFs on the BTs of the ADLD the BSFs

According to the above assumptions in Figure 2 we showthe two ADLDs corresponding to the ASDs illustrated inFigures 1(a) and 1(b) while letting 120575 = 3 And in additionto make the ADLDs more visual and intuitional in Figure 2we use the red ldquolowastrdquo to represent the FPs on the AT and the bluelines to represent the SFs on the AT or BTs

From Figure 2(a) it is easy to see that there are two SFsin the ADLD of the sequence pair (chimpanzee human)one is ASF1 that is the line segment from the point (1 1)

to the point (32 32) and the other is BSF1minus4 that is the

line segment from the point (35 31) to the point (125 121)And in addition there are totally 6 FPs in the ADLD whichare FP1(46 46) FP2(66 66) FP3(111 111) FP4(114 114)FP5(115 115) and FP6(123 123) respectively

Observing Figure 2(b) we can easily find that there arealso two SFs in the ADLD of the sequence pair (humangorilla) But different from that in Figure 2(a) the two SFsin Figure 2(b) are both ASFs one is ASF1 that is the linesegment from the point (1 1) to the point (104 104) andthe other is ASF2 that is the line segment from the point(106 106) to the point (121 121) And in addition the twoASFs in Figure 2(b) are separated by one gap and there existno FPs or BSFs on the AT or BTs

Through analysis we can know that for a given proteinsequence pair if there exist some deletions or insertions ofamino acid segments between the two protein sequencesthen there will exist some misalignments of SFs in theirADLD that is some ASFs on the AT will be transformedinto BSFs on some BTs And in addition if there exist somesubstitutions of the amino acids between the two proteinsequences then in their ADLD there will exist some gapsbetween two neighboring SFs or FPs on the AT Furthermoreif there exist some insertions deletions or substitutions of theamino acid segments at the end of the two protein sequencesthen in their ADLD there will exist no SFs or FPs on the ATor BTs

From the above descriptions it is easy to know thatthe ADLD of any given protein sequence pair obtainedby our above proposed method reflects some inner andspecific differences between these two protein sequences inthe given protein sequence pair which may be useful in thesimilaritydissimilarity analysis of protein sequence pairs

3 Results and Discussion

31 Method for SimilarityDissimilarity Analysis of ProteinSequences Based on the ADLDs According to the aboveanalysis we have known that the ADLDs may be useful inanalyzing the differences of the inner structures of proteinsequence pairs In this section we will show how to utilizethe ADLDs to analyze the similaritydissimilarity of a groupof protein sequences

Generally suppose that there are 119873 protein sequencesΨ1 Ψ2 Ψ

119873 then while applying the ADLDs to analyze

the similaritydissimilarity of these119873 sequences the similar-itydissimilarity matrix of these119873 sequences can be obtainedthrough the following steps

Step 1 According to the method given in Section 25 trans-form these 119873 protein sequences into119873 numerical sequences1198781 1198782 119878

119873

Step 2 For a given protein sequence pair Ψ119886 Ψ119887 119886 isin

1 2 119873 119887 isin 1 2 119873 we can obtain their ADLDthrough adopting the method proposed in Section 26 andthen we can obtain all of the SFs (including ASFs and BSFs)and FPs in the ADLD Hence we can obtain the lengths oftheseASFs the lengths of these BSFs and the number of theseFPs respectively

Step 3 Suppose that there are totally 1198711different ASFs

such as ASF1ASF2 ASF1198711 1198712different BSFs such

as BSF11198971

BSF21198972

BSF11987121198971198712

and 1198713different FPs such as

FP1 FP2 FP1198713 in the ADLD And in addition for eachASF119894 and BSF119895

119897119895

let their length be length119894 and length119895

respectively where 119894 isin 1 2 1198711 and 119895 isin 1 2 119871

2

then we can define the similarity degree (SD) of Ψ119886 Ψ119887 as

follows

SD (Ψ119886 Ψ119887) =

1198711

sum

119894=1

length119894 +1198712

sum

119895=1

length119895+ 1198713 (17)

Computational and Mathematical Methods in Medicine 7

150

100

50

0150100500

(a)

150

100

50

0150100500

(b)

Figure 1 (a)TheASD of the 120573-globin protein sequence pair (chimpanzee human) with 120585 = 12 (b) the ASD of the 120573-globin protein sequencepair (human gorilla) with 120585 = 16

150

100

50

0150100500

(a)

150

100

50

0

150100500

(b)

Figure 2 (a) The ADLD of the protein sequence pair (chimpanzee human) (b) the ADLD of the protein sequence pair (human gorilla)

And therefore according to these 119873 protein sequencesΨ1 Ψ2 Ψ

119873 we can obtain an 119873 times 119873 matching matrix

(MM) as follows

MM = [119889119894119895]119873times119873

(18)

where

119889119894119895

=

SD (Ψ119894 Ψ119895) if 119894 ge 119895

0 else

for 119894 isin 1 2 119873 119895 isin 1 2 119873

(19)

Step 4 Based on the matching matrixMM and all of its com-ponents 119889

119894119895 where 119894 isin 1 2 119873 and 119895 isin 1 2 119873 then

we can obtain an 119873 times 119873 similaritydissimilarity matrix (SM)of these 119873 protein sequences Ψ

1 Ψ2 Ψ

119873 as follows

SM = [119904119894119895]119873times119873

(20)

where

119904119894119895

=

1 minus Λ(

119889119894119895

119889119894119894

) if 119894 ge 119895

0 else

Λ(

119889119894119895

119889119894119894

) =

1 if119889119894119895

119889119894119894

ge 1

119889119894119895

119889119894119894

else

for 119894 isin 1 2 119873 119895 isin 1 2 119873

(21)

According to the above steps we present an examplethrough implementing the ADLDs to analyze the similar-itydissimilarity of 16 ND5 proteins (illustrated in Table 5)while letting 120575 = 3 and illustrate the results of similar-itydissimilarity matrix in Table 6

Observing Table 6 it is easy to find that there are somesimilar pairs such as (c-chim pi-chim) with the distance00510 (human c-chim) with the distance 00814 (human

8 Computational and Mathematical Methods in Medicine

Table 5 The basic information of 16 ND5 protein sequences

Number Name Abbreviation Access number Length1 Human Human ADT804301 6032 Gorilla Gorilla NP 008222 6033 Pigmy chimpanzee Pi-chim NP 008209 6034 Common chimpanzee C-chim NP 008196 6035 Fin-whale Fin-whale NP 006899 6066 Blue-whale Blue-whale NP 007066 6067 Rat Rat AP 0049021 6108 Mouse Mouse NP 904338 6079 Opossum Opossum NP 007105 60210 Sheep Sheep ABW229031 60611 Goat Goat BAN592581 60612 Lemur Lemur CAD134311 60313 Cattle Cattle ADN119021 60614 Hare Hare CAD132911 60315 Gallus Gallus BAE160361 60516 Rabbit Rabbit NP 0075591 603

Sheep

Goat

Cattle

Fin-whale

Blue-whale

Lemur

Hare

Rabbit

Gorilla

Human

Pi-chim

C-chim

Rat

Mouse

Opossum

Gallus

00383

00468

00255

00255

00162

00162

00942

00942

02169

00356

00356

01084

00540

00419

02311

00419

00855

00128

00184

00085

00665

01080

00477

00770

00243

00176

00288

00163

00458

00142

000005010015020

Figure 3 The phylogenetic tree of the 16 species based on the ADLDs based method

pi-chim) with the distance 00720 (gorilla c-chim) with thedistance 00865 (gorilla pi-chim) with the distance 00833and (fin-whale blue-whale) with the distance 00324 Andamong them the opossum seems to be a peculiar mammalsince the shortest distance between it and the remaining

mammals is more than 04023 Obviously the result isconsistent with the fact that opossum is the most remotespecies from the remaining mammals

Additionally gallus seems to be more peculiar thanopossum since the shortest distance between it and

Computational and Mathematical Methods in Medicine 9

Table6Th

esim

ilaritydissim

ilaritymatrix

forthe

16ND5proteins

basedon

theA

DLD

sbased

metho

d

Hum

anGorilla

Pi-chim

C-chim

Fin-whale

Blue-w

hale

rat

mou

seop

ossum

sheep

goat

lemur

cattle

hare

gallu

srabbit

Hum

an000

00Gorilla

01111

00000

Pi-chim

007

20008

33000

00C-

chim

008

14008

65005

10000

00Fin-whale

03396

03285

03222

03301

00000

Blue-w

hale

03474

03333

03285

03301

00324

000

00Ra

t03693

03622

03636

03716

03333

03381

000

00Mou

se03740

03686

03716

03748

03317

03333

01883

000

00Opo

ssum

044

7604551

04290

044

1804515

04519

04513

044

79000

00Sheep

03020

02933

02871

02951

02023

02067

03149

03219

04121

00000

Goat

03036

02901

02871

02919

01958

02147

03166

03468

04202

007

12000

00Lemur

02989

02708

02839

02967

02557

02724

03166

03670

040

5502087

02317

000

00Ca

ttle

03114

03045

03046

03062

01958

02051

03149

03173

04235

009

0601254

02184

000

00Hare

03146

03157

03062

03046

02832

02788

03166

03421

04023

02217

02508

02053

02532

000

00Gallus

04726

04920

04737

05008

044

50044

2304903

04743

04691

04239

04524

046

8004183

046

6000000

Rabb

it03255

03189

03222

03142

02896

02756

03084

03390

04332

02184

02603

02282

02612

008

37044

34000

00

10 Computational and Mathematical Methods in Medicine

the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal

Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences

32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species

Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3

From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree

To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7

For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points

Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)

Table 7 The basic information of 29 spike proteins

Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255

From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method

Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively

Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang

33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we

Computational and Mathematical Methods in Medicine 11

BCoVE

BCoVL

BCoVM

BCoVQ

HCoVOC43

MHVJHM

MHVP

MHVA

MHVM

TGEVG

TGEV

PEDVC

PEDV

TOR2

TaiwanTC1

FRA

BJ01

GD01

GZ02

civet007

A022

civet010

PC4137

GD03T0013

PC4127

PC4205

IBV

IBVBJ

IBVC

00000

00000

00000

00000

00369

00026

00026

00022

00022

00019

00559

00221

00019

00950

00950

01221

00010

00004

00015

00004

00004

00006

00004

00012

00012

00011

00006

00004

00004

04350

04350

00002

00002

00006

00005

00020

00005

00019

00018

00011

00202

00116

00112

00044

00040

04560

00232

00337

03498

03309

0027203461

00713

00230

00049

00052

Group II

Group I

Group III

SARS CoVs

Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5

will further discuss the intuition and visuality of the ADLDsin detail

From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3

Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)

Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total

12 Computational and Mathematical Methods in Medicine

PEDVC

PEDV

GD03T0013

BJ01

GD01

MHVA

PC4205

GZ02

TOR2

MHVM

PC4137

FRA

PC4127

TaiwanTC1

MHVP

A022

civet007

civet010

TGEVG

TGEV

BCoVM

BCoVQ

HCoVOC43

IBVC

MHVJHM

BCoVE

BCoVL

IBVBJ

IBV

00000

00000

00000

00000

00013

00006

00006

00001

00001

00000

00012

00002

00001

00010

00013

00010

00000

00000

00000

00000

00000

00000

00001

00001

00000

00000

00000

00001

0000000000

00000

00000

00006

00000

00000

00001

00000

00000

00001

00002

00001

00000

00002

00004

00002

00003

00003

00008

00006

00008

00067

00022

00013

00013

00008

00042

Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang

length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short

And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively

Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs

Computational and Mathematical Methods in Medicine 13

Fin-whale

Rabbit

Blue-whale

Sheep

Goat

Hare

Rat

Lemur

Mouse

Cattle

Opossum

Gorilla

Human

Pi-chim

C-chim

Gallus

00006

00037

00004

00004

00001

00002

00000

00001

00074

00002

00015

00000

00001

00011

00183

00001

00006

00006

00004

00005

00002

00005

00031

00008

00024

00020

00039

00060

00024

00086

Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang

700

600

500

400

300

200

100

07006005004003002001000

(a)

700

600

500

400

300

200

100

07006005004003002001000

(b)

700

600

500

400

300

200

100

07006005004003002001000

(c)

Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)

14 Computational and Mathematical Methods in Medicine

in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)

Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields

4 Conclusions

In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)

References

[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999

[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994

[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004

[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004

[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007

[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007

[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008

[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001

[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009

[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009

[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006

[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007

[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011

[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009

[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007

[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008

[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009

[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010

[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010

[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008

[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012

Computational and Mathematical Methods in Medicine 15

[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012

[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012

[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008

[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002

[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004

[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005

[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013

[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010

[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012

[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013

[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014

[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004

[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988

[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004

[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002

[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000

[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996

[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical

Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000

[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003

[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013

Page 7: ADLD: A Novel Graphical Representation of Protein

Computational and Mathematical Methods in Medicine 7

150

100

50

0150100500

(a)

150

100

50

0150100500

(b)

Figure 1 (a)TheASD of the 120573-globin protein sequence pair (chimpanzee human) with 120585 = 12 (b) the ASD of the 120573-globin protein sequencepair (human gorilla) with 120585 = 16

150

100

50

0150100500

(a)

150

100

50

0

150100500

(b)

Figure 2 (a) The ADLD of the protein sequence pair (chimpanzee human) (b) the ADLD of the protein sequence pair (human gorilla)

And therefore according to these 119873 protein sequencesΨ1 Ψ2 Ψ

119873 we can obtain an 119873 times 119873 matching matrix

(MM) as follows

MM = [119889119894119895]119873times119873

(18)

where

119889119894119895

=

SD (Ψ119894 Ψ119895) if 119894 ge 119895

0 else

for 119894 isin 1 2 119873 119895 isin 1 2 119873

(19)

Step 4 Based on the matching matrixMM and all of its com-ponents 119889

119894119895 where 119894 isin 1 2 119873 and 119895 isin 1 2 119873 then

we can obtain an 119873 times 119873 similaritydissimilarity matrix (SM)of these 119873 protein sequences Ψ

1 Ψ2 Ψ

119873 as follows

SM = [119904119894119895]119873times119873

(20)

where

119904119894119895

=

1 minus Λ(

119889119894119895

119889119894119894

) if 119894 ge 119895

0 else

Λ(

119889119894119895

119889119894119894

) =

1 if119889119894119895

119889119894119894

ge 1

119889119894119895

119889119894119894

else

for 119894 isin 1 2 119873 119895 isin 1 2 119873

(21)

According to the above steps we present an examplethrough implementing the ADLDs to analyze the similar-itydissimilarity of 16 ND5 proteins (illustrated in Table 5)while letting 120575 = 3 and illustrate the results of similar-itydissimilarity matrix in Table 6

Observing Table 6 it is easy to find that there are somesimilar pairs such as (c-chim pi-chim) with the distance00510 (human c-chim) with the distance 00814 (human

8 Computational and Mathematical Methods in Medicine

Table 5 The basic information of 16 ND5 protein sequences

Number Name Abbreviation Access number Length1 Human Human ADT804301 6032 Gorilla Gorilla NP 008222 6033 Pigmy chimpanzee Pi-chim NP 008209 6034 Common chimpanzee C-chim NP 008196 6035 Fin-whale Fin-whale NP 006899 6066 Blue-whale Blue-whale NP 007066 6067 Rat Rat AP 0049021 6108 Mouse Mouse NP 904338 6079 Opossum Opossum NP 007105 60210 Sheep Sheep ABW229031 60611 Goat Goat BAN592581 60612 Lemur Lemur CAD134311 60313 Cattle Cattle ADN119021 60614 Hare Hare CAD132911 60315 Gallus Gallus BAE160361 60516 Rabbit Rabbit NP 0075591 603

Sheep

Goat

Cattle

Fin-whale

Blue-whale

Lemur

Hare

Rabbit

Gorilla

Human

Pi-chim

C-chim

Rat

Mouse

Opossum

Gallus

00383

00468

00255

00255

00162

00162

00942

00942

02169

00356

00356

01084

00540

00419

02311

00419

00855

00128

00184

00085

00665

01080

00477

00770

00243

00176

00288

00163

00458

00142

000005010015020

Figure 3 The phylogenetic tree of the 16 species based on the ADLDs based method

pi-chim) with the distance 00720 (gorilla c-chim) with thedistance 00865 (gorilla pi-chim) with the distance 00833and (fin-whale blue-whale) with the distance 00324 Andamong them the opossum seems to be a peculiar mammalsince the shortest distance between it and the remaining

mammals is more than 04023 Obviously the result isconsistent with the fact that opossum is the most remotespecies from the remaining mammals

Additionally gallus seems to be more peculiar thanopossum since the shortest distance between it and

Computational and Mathematical Methods in Medicine 9

Table6Th

esim

ilaritydissim

ilaritymatrix

forthe

16ND5proteins

basedon

theA

DLD

sbased

metho

d

Hum

anGorilla

Pi-chim

C-chim

Fin-whale

Blue-w

hale

rat

mou

seop

ossum

sheep

goat

lemur

cattle

hare

gallu

srabbit

Hum

an000

00Gorilla

01111

00000

Pi-chim

007

20008

33000

00C-

chim

008

14008

65005

10000

00Fin-whale

03396

03285

03222

03301

00000

Blue-w

hale

03474

03333

03285

03301

00324

000

00Ra

t03693

03622

03636

03716

03333

03381

000

00Mou

se03740

03686

03716

03748

03317

03333

01883

000

00Opo

ssum

044

7604551

04290

044

1804515

04519

04513

044

79000

00Sheep

03020

02933

02871

02951

02023

02067

03149

03219

04121

00000

Goat

03036

02901

02871

02919

01958

02147

03166

03468

04202

007

12000

00Lemur

02989

02708

02839

02967

02557

02724

03166

03670

040

5502087

02317

000

00Ca

ttle

03114

03045

03046

03062

01958

02051

03149

03173

04235

009

0601254

02184

000

00Hare

03146

03157

03062

03046

02832

02788

03166

03421

04023

02217

02508

02053

02532

000

00Gallus

04726

04920

04737

05008

044

50044

2304903

04743

04691

04239

04524

046

8004183

046

6000000

Rabb

it03255

03189

03222

03142

02896

02756

03084

03390

04332

02184

02603

02282

02612

008

37044

34000

00

10 Computational and Mathematical Methods in Medicine

the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal

Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences

32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species

Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3

From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree

To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7

For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points

Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)

Table 7 The basic information of 29 spike proteins

Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255

From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method

Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively

Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang

33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we

Computational and Mathematical Methods in Medicine 11

BCoVE

BCoVL

BCoVM

BCoVQ

HCoVOC43

MHVJHM

MHVP

MHVA

MHVM

TGEVG

TGEV

PEDVC

PEDV

TOR2

TaiwanTC1

FRA

BJ01

GD01

GZ02

civet007

A022

civet010

PC4137

GD03T0013

PC4127

PC4205

IBV

IBVBJ

IBVC

00000

00000

00000

00000

00369

00026

00026

00022

00022

00019

00559

00221

00019

00950

00950

01221

00010

00004

00015

00004

00004

00006

00004

00012

00012

00011

00006

00004

00004

04350

04350

00002

00002

00006

00005

00020

00005

00019

00018

00011

00202

00116

00112

00044

00040

04560

00232

00337

03498

03309

0027203461

00713

00230

00049

00052

Group II

Group I

Group III

SARS CoVs

Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5

will further discuss the intuition and visuality of the ADLDsin detail

From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3

Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)

Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total

12 Computational and Mathematical Methods in Medicine

PEDVC

PEDV

GD03T0013

BJ01

GD01

MHVA

PC4205

GZ02

TOR2

MHVM

PC4137

FRA

PC4127

TaiwanTC1

MHVP

A022

civet007

civet010

TGEVG

TGEV

BCoVM

BCoVQ

HCoVOC43

IBVC

MHVJHM

BCoVE

BCoVL

IBVBJ

IBV

00000

00000

00000

00000

00013

00006

00006

00001

00001

00000

00012

00002

00001

00010

00013

00010

00000

00000

00000

00000

00000

00000

00001

00001

00000

00000

00000

00001

0000000000

00000

00000

00006

00000

00000

00001

00000

00000

00001

00002

00001

00000

00002

00004

00002

00003

00003

00008

00006

00008

00067

00022

00013

00013

00008

00042

Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang

length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short

And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively

Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs

Computational and Mathematical Methods in Medicine 13

Fin-whale

Rabbit

Blue-whale

Sheep

Goat

Hare

Rat

Lemur

Mouse

Cattle

Opossum

Gorilla

Human

Pi-chim

C-chim

Gallus

00006

00037

00004

00004

00001

00002

00000

00001

00074

00002

00015

00000

00001

00011

00183

00001

00006

00006

00004

00005

00002

00005

00031

00008

00024

00020

00039

00060

00024

00086

Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang

700

600

500

400

300

200

100

07006005004003002001000

(a)

700

600

500

400

300

200

100

07006005004003002001000

(b)

700

600

500

400

300

200

100

07006005004003002001000

(c)

Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)

14 Computational and Mathematical Methods in Medicine

in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)

Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields

4 Conclusions

In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)

References

[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999

[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994

[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004

[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004

[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007

[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007

[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008

[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001

[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009

[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009

[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006

[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007

[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011

[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009

[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007

[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008

[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009

[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010

[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010

[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008

[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012

Computational and Mathematical Methods in Medicine 15

[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012

[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012

[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008

[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002

[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004

[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005

[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013

[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010

[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012

[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013

[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014

[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004

[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988

[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004

[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002

[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000

[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996

[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical

Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000

[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003

[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013

Page 8: ADLD: A Novel Graphical Representation of Protein

8 Computational and Mathematical Methods in Medicine

Table 5 The basic information of 16 ND5 protein sequences

Number Name Abbreviation Access number Length1 Human Human ADT804301 6032 Gorilla Gorilla NP 008222 6033 Pigmy chimpanzee Pi-chim NP 008209 6034 Common chimpanzee C-chim NP 008196 6035 Fin-whale Fin-whale NP 006899 6066 Blue-whale Blue-whale NP 007066 6067 Rat Rat AP 0049021 6108 Mouse Mouse NP 904338 6079 Opossum Opossum NP 007105 60210 Sheep Sheep ABW229031 60611 Goat Goat BAN592581 60612 Lemur Lemur CAD134311 60313 Cattle Cattle ADN119021 60614 Hare Hare CAD132911 60315 Gallus Gallus BAE160361 60516 Rabbit Rabbit NP 0075591 603

Sheep

Goat

Cattle

Fin-whale

Blue-whale

Lemur

Hare

Rabbit

Gorilla

Human

Pi-chim

C-chim

Rat

Mouse

Opossum

Gallus

00383

00468

00255

00255

00162

00162

00942

00942

02169

00356

00356

01084

00540

00419

02311

00419

00855

00128

00184

00085

00665

01080

00477

00770

00243

00176

00288

00163

00458

00142

000005010015020

Figure 3 The phylogenetic tree of the 16 species based on the ADLDs based method

pi-chim) with the distance 00720 (gorilla c-chim) with thedistance 00865 (gorilla pi-chim) with the distance 00833and (fin-whale blue-whale) with the distance 00324 Andamong them the opossum seems to be a peculiar mammalsince the shortest distance between it and the remaining

mammals is more than 04023 Obviously the result isconsistent with the fact that opossum is the most remotespecies from the remaining mammals

Additionally gallus seems to be more peculiar thanopossum since the shortest distance between it and

Computational and Mathematical Methods in Medicine 9

Table6Th

esim

ilaritydissim

ilaritymatrix

forthe

16ND5proteins

basedon

theA

DLD

sbased

metho

d

Hum

anGorilla

Pi-chim

C-chim

Fin-whale

Blue-w

hale

rat

mou

seop

ossum

sheep

goat

lemur

cattle

hare

gallu

srabbit

Hum

an000

00Gorilla

01111

00000

Pi-chim

007

20008

33000

00C-

chim

008

14008

65005

10000

00Fin-whale

03396

03285

03222

03301

00000

Blue-w

hale

03474

03333

03285

03301

00324

000

00Ra

t03693

03622

03636

03716

03333

03381

000

00Mou

se03740

03686

03716

03748

03317

03333

01883

000

00Opo

ssum

044

7604551

04290

044

1804515

04519

04513

044

79000

00Sheep

03020

02933

02871

02951

02023

02067

03149

03219

04121

00000

Goat

03036

02901

02871

02919

01958

02147

03166

03468

04202

007

12000

00Lemur

02989

02708

02839

02967

02557

02724

03166

03670

040

5502087

02317

000

00Ca

ttle

03114

03045

03046

03062

01958

02051

03149

03173

04235

009

0601254

02184

000

00Hare

03146

03157

03062

03046

02832

02788

03166

03421

04023

02217

02508

02053

02532

000

00Gallus

04726

04920

04737

05008

044

50044

2304903

04743

04691

04239

04524

046

8004183

046

6000000

Rabb

it03255

03189

03222

03142

02896

02756

03084

03390

04332

02184

02603

02282

02612

008

37044

34000

00

10 Computational and Mathematical Methods in Medicine

the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal

Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences

32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species

Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3

From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree

To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7

For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points

Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)

Table 7 The basic information of 29 spike proteins

Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255

From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method

Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively

Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang

33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we

Computational and Mathematical Methods in Medicine 11

BCoVE

BCoVL

BCoVM

BCoVQ

HCoVOC43

MHVJHM

MHVP

MHVA

MHVM

TGEVG

TGEV

PEDVC

PEDV

TOR2

TaiwanTC1

FRA

BJ01

GD01

GZ02

civet007

A022

civet010

PC4137

GD03T0013

PC4127

PC4205

IBV

IBVBJ

IBVC

00000

00000

00000

00000

00369

00026

00026

00022

00022

00019

00559

00221

00019

00950

00950

01221

00010

00004

00015

00004

00004

00006

00004

00012

00012

00011

00006

00004

00004

04350

04350

00002

00002

00006

00005

00020

00005

00019

00018

00011

00202

00116

00112

00044

00040

04560

00232

00337

03498

03309

0027203461

00713

00230

00049

00052

Group II

Group I

Group III

SARS CoVs

Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5

will further discuss the intuition and visuality of the ADLDsin detail

From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3

Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)

Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total

12 Computational and Mathematical Methods in Medicine

PEDVC

PEDV

GD03T0013

BJ01

GD01

MHVA

PC4205

GZ02

TOR2

MHVM

PC4137

FRA

PC4127

TaiwanTC1

MHVP

A022

civet007

civet010

TGEVG

TGEV

BCoVM

BCoVQ

HCoVOC43

IBVC

MHVJHM

BCoVE

BCoVL

IBVBJ

IBV

00000

00000

00000

00000

00013

00006

00006

00001

00001

00000

00012

00002

00001

00010

00013

00010

00000

00000

00000

00000

00000

00000

00001

00001

00000

00000

00000

00001

0000000000

00000

00000

00006

00000

00000

00001

00000

00000

00001

00002

00001

00000

00002

00004

00002

00003

00003

00008

00006

00008

00067

00022

00013

00013

00008

00042

Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang

length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short

And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively

Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs

Computational and Mathematical Methods in Medicine 13

Fin-whale

Rabbit

Blue-whale

Sheep

Goat

Hare

Rat

Lemur

Mouse

Cattle

Opossum

Gorilla

Human

Pi-chim

C-chim

Gallus

00006

00037

00004

00004

00001

00002

00000

00001

00074

00002

00015

00000

00001

00011

00183

00001

00006

00006

00004

00005

00002

00005

00031

00008

00024

00020

00039

00060

00024

00086

Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang

700

600

500

400

300

200

100

07006005004003002001000

(a)

700

600

500

400

300

200

100

07006005004003002001000

(b)

700

600

500

400

300

200

100

07006005004003002001000

(c)

Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)

14 Computational and Mathematical Methods in Medicine

in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)

Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields

4 Conclusions

In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)

References

[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999

[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994

[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004

[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004

[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007

[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007

[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008

[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001

[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009

[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009

[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006

[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007

[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011

[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009

[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007

[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008

[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009

[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010

[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010

[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008

[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012

Computational and Mathematical Methods in Medicine 15

[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012

[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012

[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008

[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002

[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004

[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005

[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013

[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010

[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012

[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013

[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014

[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004

[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988

[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004

[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002

[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000

[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996

[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical

Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000

[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003

[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013

Page 9: ADLD: A Novel Graphical Representation of Protein

Computational and Mathematical Methods in Medicine 9

Table6Th

esim

ilaritydissim

ilaritymatrix

forthe

16ND5proteins

basedon

theA

DLD

sbased

metho

d

Hum

anGorilla

Pi-chim

C-chim

Fin-whale

Blue-w

hale

rat

mou

seop

ossum

sheep

goat

lemur

cattle

hare

gallu

srabbit

Hum

an000

00Gorilla

01111

00000

Pi-chim

007

20008

33000

00C-

chim

008

14008

65005

10000

00Fin-whale

03396

03285

03222

03301

00000

Blue-w

hale

03474

03333

03285

03301

00324

000

00Ra

t03693

03622

03636

03716

03333

03381

000

00Mou

se03740

03686

03716

03748

03317

03333

01883

000

00Opo

ssum

044

7604551

04290

044

1804515

04519

04513

044

79000

00Sheep

03020

02933

02871

02951

02023

02067

03149

03219

04121

00000

Goat

03036

02901

02871

02919

01958

02147

03166

03468

04202

007

12000

00Lemur

02989

02708

02839

02967

02557

02724

03166

03670

040

5502087

02317

000

00Ca

ttle

03114

03045

03046

03062

01958

02051

03149

03173

04235

009

0601254

02184

000

00Hare

03146

03157

03062

03046

02832

02788

03166

03421

04023

02217

02508

02053

02532

000

00Gallus

04726

04920

04737

05008

044

50044

2304903

04743

04691

04239

04524

046

8004183

046

6000000

Rabb

it03255

03189

03222

03142

02896

02756

03084

03390

04332

02184

02603

02282

02612

008

37044

34000

00

10 Computational and Mathematical Methods in Medicine

the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal

Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences

32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species

Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3

From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree

To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7

For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points

Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)

Table 7 The basic information of 29 spike proteins

Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255

From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method

Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively

Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang

33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we

Computational and Mathematical Methods in Medicine 11

BCoVE

BCoVL

BCoVM

BCoVQ

HCoVOC43

MHVJHM

MHVP

MHVA

MHVM

TGEVG

TGEV

PEDVC

PEDV

TOR2

TaiwanTC1

FRA

BJ01

GD01

GZ02

civet007

A022

civet010

PC4137

GD03T0013

PC4127

PC4205

IBV

IBVBJ

IBVC

00000

00000

00000

00000

00369

00026

00026

00022

00022

00019

00559

00221

00019

00950

00950

01221

00010

00004

00015

00004

00004

00006

00004

00012

00012

00011

00006

00004

00004

04350

04350

00002

00002

00006

00005

00020

00005

00019

00018

00011

00202

00116

00112

00044

00040

04560

00232

00337

03498

03309

0027203461

00713

00230

00049

00052

Group II

Group I

Group III

SARS CoVs

Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5

will further discuss the intuition and visuality of the ADLDsin detail

From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3

Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)

Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total

12 Computational and Mathematical Methods in Medicine

PEDVC

PEDV

GD03T0013

BJ01

GD01

MHVA

PC4205

GZ02

TOR2

MHVM

PC4137

FRA

PC4127

TaiwanTC1

MHVP

A022

civet007

civet010

TGEVG

TGEV

BCoVM

BCoVQ

HCoVOC43

IBVC

MHVJHM

BCoVE

BCoVL

IBVBJ

IBV

00000

00000

00000

00000

00013

00006

00006

00001

00001

00000

00012

00002

00001

00010

00013

00010

00000

00000

00000

00000

00000

00000

00001

00001

00000

00000

00000

00001

0000000000

00000

00000

00006

00000

00000

00001

00000

00000

00001

00002

00001

00000

00002

00004

00002

00003

00003

00008

00006

00008

00067

00022

00013

00013

00008

00042

Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang

length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short

And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively

Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs

Computational and Mathematical Methods in Medicine 13

Fin-whale

Rabbit

Blue-whale

Sheep

Goat

Hare

Rat

Lemur

Mouse

Cattle

Opossum

Gorilla

Human

Pi-chim

C-chim

Gallus

00006

00037

00004

00004

00001

00002

00000

00001

00074

00002

00015

00000

00001

00011

00183

00001

00006

00006

00004

00005

00002

00005

00031

00008

00024

00020

00039

00060

00024

00086

Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang

700

600

500

400

300

200

100

07006005004003002001000

(a)

700

600

500

400

300

200

100

07006005004003002001000

(b)

700

600

500

400

300

200

100

07006005004003002001000

(c)

Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)

14 Computational and Mathematical Methods in Medicine

in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)

Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields

4 Conclusions

In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)

References

[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999

[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994

[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004

[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004

[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007

[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007

[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008

[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001

[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009

[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009

[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006

[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007

[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011

[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009

[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007

[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008

[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009

[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010

[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010

[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008

[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012

Computational and Mathematical Methods in Medicine 15

[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012

[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012

[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008

[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002

[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004

[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005

[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013

[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010

[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012

[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013

[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014

[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004

[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988

[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004

[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002

[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000

[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996

[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical

Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000

[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003

[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013

Page 10: ADLD: A Novel Graphical Representation of Protein

10 Computational and Mathematical Methods in Medicine

the remaining animals is more than 04423 which is biggerthan 04023 (the shortest distance between Opossum andthe remaining mammals) Obviously the result is consistentwith the fact that gallus is not a kind of mammal

Therefore it is apparent that the results illustrated inTable 6 are wholly consistent with the results of the knownfact of evolution That is to say our ADLDs based methodcan be utilized as an effective way to analyze the similari-tiesdissimilarities of protein sequences

32 The Phylogenetic Tree of the Protein Sequences Based onthe ADLDs A phylogenetic tree is a diagram that is used torepresent the evolutionary relationships of organisms that arethought to have a common ancestry and it is a commonlyused tool for researchers in some fields to help them analyzethe clustering of different species

Obviously only through observing the similaritydissimilaritymatrix illustrated inTable 6 wewill find that it isnot very convenient to distinguish the similaritydissimilarityof protein sequences Therefore in order to show thesimilaritydissimilarity of the protein sequences more vividlyand intuitively according to the similaritydissimilaritymatrix illustrated in Table 6 then we will construct thephylogenetic tree of the above 16 ND5 proteins throughadopting the software MEGA 606 that is provided byTamura et al [41] and the result is illustrated in Figure 3

From Figure 3 it is obvious that we can not only findout the evolutionary relationships of these 16 ND5 proteinsequences visually and intuitively but also know easily thatthe constructed phylogenetic tree is consistent with theresults of the known fact of evolution to some degree

To further validate the performance of our ADLDs basedmethod we applied our method to analyze the similar-itydissimilarity of another group of proteins including 29spike proteins of coronavirus and compared ourmethodwiththe method proposed by Wen and Zhang [17] based on theabove given 16 ND5 proteins and the following 29 spikeproteins respectively The basic information of the 29 spikeproteins is illustrated in Table 7

For the 29 spike proteins illustrated in Table 7 weconstruct the phylogenetic tree in Figure 4 Since the spikeprotein sequences are very long (with more than 1100 aminoacids) therefore during simulation we set 120575 = 5 to avoid theeffect of noise points

Generally coronavirus can always be classified into fourclasses such as the Group I the Group II the Group IIIand the SARS-CoVs (Severe Acute Respiratory SyndromeCoronaviruses) And among these four classes the GroupI includes the Canine coronavirus (CCoV) the Feline coron-avirus (FCoV) the Human coronavirus 229E (HCoV-229E)the Porcine epidemic diarrhea virus (PEDV) and the Trans-missible gastroenteritis virus (TGEV) The Group II includesthe Bovine coronavirus (BCoV) Human coronavirus OC43(HCoV-OC43) the Murine coronavirus Mouse hepatitisvirus (MHV) the Porcine hemagglutinating encephalomyeli-tis virus (HEV) and the Rat coronavirus (RtCoV)TheGroupIII contains theAvian infectious bronchitis virus (IBV) and theTurkey coronavirus (TCoV)

Table 7 The basic information of 29 spike proteins

Number Access number Abbreviation Length1 CAB91145 TGEVG 14472 NP 058424 TGEV 14473 AAK38656 PEDVC 13834 NP 598310 PEDV 13835 NP 937950 HCoVOC43 13616 AAK83356 BCoVE 13637 AAL57308 BCoVL 13638 AAA66399 BCoVM 13639 AAL40400 BCoVQ 136310 AAB86819 MHVA 132411 YP 209233 MHVJHM 137612 AAF69334 MHVP 132113 AAF69344 MHVM 132414 AAP92675 IBVBJ 116915 AAS00080 IBVC 116916 NP 040831 IBV 116217 AAS10463 GD03T0013 125518 AAU93318 PC4127 125519 AAV49720 PC4137 125520 AAU93319 PC4205 125521 AAU04646 civet007 125522 AAU04649 civet010 125523 AAV91631 A022 125524 AAP51227 GD01 125525 AAS00003 GZ02 125526 AAP30030 BJ01 125527 AAP50485 FRA 125528 AAP41037 TOR2 125529 AAQ01597 TaiwanTC1 1255

From observing Figure 4 it is easy to know that the 29spike proteins of coronavirus can be perfectly classified intothe above four classes by our ADLDs based method

Finally for the convenience of comparison we illustratethe phylogenetic trees of the above given 29 spike proteins ofcoronavirus and 16ND5proteins constructed by adopting themethod proposed byWen and Zhang [17] in Figures 5 and 6respectively

Comparing Figure 3 with Figure 6 and Figure 4 withFigure 5 respectively it is obvious that the phylogenetic treesobtained by the method proposed by Wen and Zhang arequite unreasonable and not consistent with the known factsof evolution at all But on the contrary the phylogenetictrees obtained by our ADLDs based method are not onlyquite reasonable but also consistent with the known facts ofevolution to some degree Therefore there is no doubt thatthe performance of our method is much better than that ofthe method proposed by Wen and Zhang

33 The Analysis of Intuition and Visuality of the ADLDsIn Section 26 we have stated that the ADLDs of proteinsequence pairs are intuitional and visual In this section we

Computational and Mathematical Methods in Medicine 11

BCoVE

BCoVL

BCoVM

BCoVQ

HCoVOC43

MHVJHM

MHVP

MHVA

MHVM

TGEVG

TGEV

PEDVC

PEDV

TOR2

TaiwanTC1

FRA

BJ01

GD01

GZ02

civet007

A022

civet010

PC4137

GD03T0013

PC4127

PC4205

IBV

IBVBJ

IBVC

00000

00000

00000

00000

00369

00026

00026

00022

00022

00019

00559

00221

00019

00950

00950

01221

00010

00004

00015

00004

00004

00006

00004

00012

00012

00011

00006

00004

00004

04350

04350

00002

00002

00006

00005

00020

00005

00019

00018

00011

00202

00116

00112

00044

00040

04560

00232

00337

03498

03309

0027203461

00713

00230

00049

00052

Group II

Group I

Group III

SARS CoVs

Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5

will further discuss the intuition and visuality of the ADLDsin detail

From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3

Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)

Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total

12 Computational and Mathematical Methods in Medicine

PEDVC

PEDV

GD03T0013

BJ01

GD01

MHVA

PC4205

GZ02

TOR2

MHVM

PC4137

FRA

PC4127

TaiwanTC1

MHVP

A022

civet007

civet010

TGEVG

TGEV

BCoVM

BCoVQ

HCoVOC43

IBVC

MHVJHM

BCoVE

BCoVL

IBVBJ

IBV

00000

00000

00000

00000

00013

00006

00006

00001

00001

00000

00012

00002

00001

00010

00013

00010

00000

00000

00000

00000

00000

00000

00001

00001

00000

00000

00000

00001

0000000000

00000

00000

00006

00000

00000

00001

00000

00000

00001

00002

00001

00000

00002

00004

00002

00003

00003

00008

00006

00008

00067

00022

00013

00013

00008

00042

Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang

length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short

And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively

Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs

Computational and Mathematical Methods in Medicine 13

Fin-whale

Rabbit

Blue-whale

Sheep

Goat

Hare

Rat

Lemur

Mouse

Cattle

Opossum

Gorilla

Human

Pi-chim

C-chim

Gallus

00006

00037

00004

00004

00001

00002

00000

00001

00074

00002

00015

00000

00001

00011

00183

00001

00006

00006

00004

00005

00002

00005

00031

00008

00024

00020

00039

00060

00024

00086

Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang

700

600

500

400

300

200

100

07006005004003002001000

(a)

700

600

500

400

300

200

100

07006005004003002001000

(b)

700

600

500

400

300

200

100

07006005004003002001000

(c)

Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)

14 Computational and Mathematical Methods in Medicine

in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)

Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields

4 Conclusions

In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)

References

[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999

[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994

[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004

[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004

[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007

[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007

[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008

[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001

[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009

[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009

[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006

[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007

[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011

[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009

[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007

[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008

[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009

[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010

[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010

[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008

[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012

Computational and Mathematical Methods in Medicine 15

[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012

[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012

[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008

[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002

[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004

[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005

[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013

[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010

[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012

[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013

[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014

[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004

[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988

[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004

[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002

[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000

[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996

[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical

Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000

[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003

[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013

Page 11: ADLD: A Novel Graphical Representation of Protein

Computational and Mathematical Methods in Medicine 11

BCoVE

BCoVL

BCoVM

BCoVQ

HCoVOC43

MHVJHM

MHVP

MHVA

MHVM

TGEVG

TGEV

PEDVC

PEDV

TOR2

TaiwanTC1

FRA

BJ01

GD01

GZ02

civet007

A022

civet010

PC4137

GD03T0013

PC4127

PC4205

IBV

IBVBJ

IBVC

00000

00000

00000

00000

00369

00026

00026

00022

00022

00019

00559

00221

00019

00950

00950

01221

00010

00004

00015

00004

00004

00006

00004

00012

00012

00011

00006

00004

00004

04350

04350

00002

00002

00006

00005

00020

00005

00019

00018

00011

00202

00116

00112

00044

00040

04560

00232

00337

03498

03309

0027203461

00713

00230

00049

00052

Group II

Group I

Group III

SARS CoVs

Figure 4 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the ADLDs based method with 120575 = 5

will further discuss the intuition and visuality of the ADLDsin detail

From Table 6 we can obtain some similar pairs suchas (fin-whale blue-whale) (pi-chim c-chim) (Human c-chim) (cheep goat) (human pi-chim) and (hare rabbit)and some dissimilar pairs such as (human opossum) and(human gallus) among the above given 16 ND5 proteinsFrom these similardissimilar pairs wewill choose three pairsincluding (human gorilla) (human opossum) and (humangallus) as examples to further show the intuition and visualityof the ADLDs of these three protein sequence pairs TheADLDs of these three similardissimilar pairs are illustratedin Figure 7 while letting 120575 = 3

Observing Figure 7 we can clearly find that the totallength of all of the SFs in each of these three ADLDs satisfiesthe total length of all of the SFs in the ADLD of Figure 7(a) gtthe total length of all of the SFs in the ADLD of Figure 7(b) gtthe total length of all of the SFs in the ADLD of Figure 7(c)Therefore we can intuitively identify that the similarity of theproteins in each of these three protein sequence pairs satisfiesthe similarity of the proteins in the pair (human gorilla) gt thesimilarity of the proteins in the pair (human opossum) gt thesimilarity of the proteins in the pair (human gallus)

Moreover from Figure 7 we can also intuitively identifythat the two protein sequences in the protein sequence pair(human gorilla) are very similar to each other since the total

12 Computational and Mathematical Methods in Medicine

PEDVC

PEDV

GD03T0013

BJ01

GD01

MHVA

PC4205

GZ02

TOR2

MHVM

PC4137

FRA

PC4127

TaiwanTC1

MHVP

A022

civet007

civet010

TGEVG

TGEV

BCoVM

BCoVQ

HCoVOC43

IBVC

MHVJHM

BCoVE

BCoVL

IBVBJ

IBV

00000

00000

00000

00000

00013

00006

00006

00001

00001

00000

00012

00002

00001

00010

00013

00010

00000

00000

00000

00000

00000

00000

00001

00001

00000

00000

00000

00001

0000000000

00000

00000

00006

00000

00000

00001

00000

00000

00001

00002

00001

00000

00002

00004

00002

00003

00003

00008

00006

00008

00067

00022

00013

00013

00008

00042

Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang

length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short

And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively

Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs

Computational and Mathematical Methods in Medicine 13

Fin-whale

Rabbit

Blue-whale

Sheep

Goat

Hare

Rat

Lemur

Mouse

Cattle

Opossum

Gorilla

Human

Pi-chim

C-chim

Gallus

00006

00037

00004

00004

00001

00002

00000

00001

00074

00002

00015

00000

00001

00011

00183

00001

00006

00006

00004

00005

00002

00005

00031

00008

00024

00020

00039

00060

00024

00086

Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang

700

600

500

400

300

200

100

07006005004003002001000

(a)

700

600

500

400

300

200

100

07006005004003002001000

(b)

700

600

500

400

300

200

100

07006005004003002001000

(c)

Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)

14 Computational and Mathematical Methods in Medicine

in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)

Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields

4 Conclusions

In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)

References

[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999

[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994

[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004

[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004

[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007

[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007

[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008

[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001

[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009

[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009

[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006

[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007

[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011

[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009

[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007

[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008

[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009

[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010

[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010

[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008

[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012

Computational and Mathematical Methods in Medicine 15

[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012

[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012

[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008

[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002

[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004

[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005

[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013

[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010

[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012

[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013

[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014

[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004

[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988

[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004

[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002

[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000

[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996

[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical

Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000

[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003

[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013

Page 12: ADLD: A Novel Graphical Representation of Protein

12 Computational and Mathematical Methods in Medicine

PEDVC

PEDV

GD03T0013

BJ01

GD01

MHVA

PC4205

GZ02

TOR2

MHVM

PC4137

FRA

PC4127

TaiwanTC1

MHVP

A022

civet007

civet010

TGEVG

TGEV

BCoVM

BCoVQ

HCoVOC43

IBVC

MHVJHM

BCoVE

BCoVL

IBVBJ

IBV

00000

00000

00000

00000

00013

00006

00006

00001

00001

00000

00012

00002

00001

00010

00013

00010

00000

00000

00000

00000

00000

00000

00001

00001

00000

00000

00000

00001

0000000000

00000

00000

00006

00000

00000

00001

00000

00000

00001

00002

00001

00000

00002

00004

00002

00003

00003

00008

00006

00008

00067

00022

00013

00013

00008

00042

Figure 5 The phylogenetic tree of the 29 spike proteins of coronavirus constructed by adopting the method proposed by Wen and Zhang

length of all of the SFs in the ADLD of Figure 7(a) looksvery long But on the contrary we can intuitively identifythat the two protein sequences in either the protein sequencepair (human opossum) or the protein sequence pair (humangallus) are apparently dissimilar to each other since both thetotal length of all of the SFs in the ADLD of Figure 7(b) andthat in the ADLD of Figure 7(c) look very short

And through statistic we can know that the actual totallengths of all of the SFs in the ADLDs of these three proteinsequence pairs (human gorilla) (human opossum) and(human gallus) are 556 288 and 248 respectively

Additionally observing Figures 2(a) and 2(b) hardly canwe distinguish the total length of all of the SFs (includingASFs and BSFs) in the ADLD of Figure 2(a) and that inthe ADLD of Figure 2(b) since the total lengths of all ofthe SFs in these two ADLDs look nearly the same Andthrough statistic we can know that the actual total lengthsof all of the SFs in the ADLDs of Figures 2(a) and 2(b)are 123 and 120 respectively and are really close to eachother But through comparing Figure 2(a) with Figure 2(b)more carefully we can further discover that different fromFigure 2(b) except for the SFs there are also 6 different FPs

Computational and Mathematical Methods in Medicine 13

Fin-whale

Rabbit

Blue-whale

Sheep

Goat

Hare

Rat

Lemur

Mouse

Cattle

Opossum

Gorilla

Human

Pi-chim

C-chim

Gallus

00006

00037

00004

00004

00001

00002

00000

00001

00074

00002

00015

00000

00001

00011

00183

00001

00006

00006

00004

00005

00002

00005

00031

00008

00024

00020

00039

00060

00024

00086

Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang

700

600

500

400

300

200

100

07006005004003002001000

(a)

700

600

500

400

300

200

100

07006005004003002001000

(b)

700

600

500

400

300

200

100

07006005004003002001000

(c)

Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)

14 Computational and Mathematical Methods in Medicine

in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)

Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields

4 Conclusions

In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)

References

[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999

[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994

[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004

[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004

[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007

[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007

[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008

[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001

[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009

[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009

[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006

[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007

[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011

[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009

[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007

[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008

[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009

[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010

[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010

[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008

[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012

Computational and Mathematical Methods in Medicine 15

[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012

[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012

[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008

[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002

[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004

[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005

[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013

[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010

[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012

[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013

[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014

[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004

[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988

[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004

[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002

[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000

[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996

[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical

Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000

[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003

[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013

Page 13: ADLD: A Novel Graphical Representation of Protein

Computational and Mathematical Methods in Medicine 13

Fin-whale

Rabbit

Blue-whale

Sheep

Goat

Hare

Rat

Lemur

Mouse

Cattle

Opossum

Gorilla

Human

Pi-chim

C-chim

Gallus

00006

00037

00004

00004

00001

00002

00000

00001

00074

00002

00015

00000

00001

00011

00183

00001

00006

00006

00004

00005

00002

00005

00031

00008

00024

00020

00039

00060

00024

00086

Figure 6 The phylogenetic tree of the 16 ND5 proteins constructed by adopting the method proposed by Wen and Zhang

700

600

500

400

300

200

100

07006005004003002001000

(a)

700

600

500

400

300

200

100

07006005004003002001000

(b)

700

600

500

400

300

200

100

07006005004003002001000

(c)

Figure 7 (a) The ADLD of the similar pair (human gorilla) (b) the ADLD of the dissimilar pair (human opossum) (c) the ADLD of thedissimilar pair (human gallus)

14 Computational and Mathematical Methods in Medicine

in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)

Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields

4 Conclusions

In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)

References

[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999

[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994

[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004

[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004

[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007

[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007

[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008

[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001

[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009

[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009

[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006

[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007

[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011

[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009

[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007

[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008

[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009

[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010

[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010

[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008

[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012

Computational and Mathematical Methods in Medicine 15

[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012

[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012

[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008

[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002

[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004

[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005

[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013

[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010

[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012

[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013

[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014

[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004

[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988

[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004

[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002

[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000

[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996

[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical

Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000

[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003

[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013

Page 14: ADLD: A Novel Graphical Representation of Protein

14 Computational and Mathematical Methods in Medicine

in the ADLD of Figure 2(a) while there are no FPs in theADLD of Figure 2(b) therefore we can intuitively identifythat the two protein sequences in the protein sequence pair(chimpanzee human) are more similar to the two proteinsequences in the protein sequence pair (human gorilla)

Hence from the above descriptions we can know that theADLDs obtained by our newly proposed method are quitevisual and intuitional andmay be a powerful and effective toolfor visual comparison of protein sequences and numericalsequences in other research fields

4 Conclusions

In this paper a novel ADLDs based graphical representationof protein sequences is proposed which is utilized to analyzethe similaritydissimilarity of protein sequences To validatethe performances of the newmethod we select two groups ofwell-known protein sequences as examples and additionallyin order to observe the similaritydissimilarity of proteinsequences more intuitively we construct the phylogenetictrees of protein sequences The results show that our ADLDsbased method not only has good performances and effects inthe similaritydissimilarity analysis of protein sequences butalso does not require complex computation since there areno high dimensional matrixes required Therefore it meansthat our ADLDs based method can work well in the analysisof protein sequences

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors thank the anonymous referees for suggestionsthat helped in improving the paper substantially And theproject is partly sponsored by the Colleges and UniversitiesOpen Innovation Platform Fund of Hunan Province (no13K041) the Hunan Provincial Natural Science Foundationof China (no 14JJ2070) the Construct Program of the KeyDiscipline in Hunan province the State Education MinistryScientific Research Foundation for the Returned OverseasChinese Scholars and the Introduced Talent Start-up FundProject of Xiangtan University (no 11QDZ45)

References

[1] K Bucka-Lassen O Caprani and J Hein ldquoCombining manymultiple alignments in one improved alignmentrdquo Bioinformat-ics vol 15 no 2 pp 122ndash130 1999

[2] L Wang and T Jiang ldquoOn the complexity of multiple sequencealignmentrdquo Journal of Computational Biology vol 1 no 4 pp337ndash348 1994

[3] C Shyu L Sheneman and J A Foster ldquoMultiple sequencealignment with evolutionary computationrdquo Genetic Program-ming and Evolvable Machines vol 5 no 2 pp 121ndash144 2004

[4] M Randic ldquoGraphical representations of DNA as 2-D maprdquoChemical Physics Letters vol 386 no 4ndash6 pp 468ndash471 2004

[5] D Bielinska-Waz T Clark P Waz W Nowak and A Nandyldquo2D-dynamic representation of DNA sequencesrdquo ChemicalPhysics Letters vol 442 no 1ndash3 pp 140ndash144 2007

[6] Z Qi and X Qi ldquoNovel 2D graphical representation of DNAsequence based on dual nucleotidesrdquo Chemical Physics Lettersvol 440 no 1ndash3 pp 139ndash144 2007

[7] W Chen B Liao Y Liu and Z Su ldquoA numerical representationof DNA sequences and its applicationsrdquoMATCH Communica-tions in Mathematical and in Computer Chemistry vol 60 no2 pp 291ndash300 2008

[8] M Randic X Guo and S C Basak ldquoOn the characterization ofDNAprimary sequences by triplet of nucleic acid basesrdquo Journalof Chemical Information and Computer Sciences vol 41 no 3pp 619ndash626 2001

[9] M Randic M Vracko M Novic and D Plavsic ldquoSpectralrepresentation of reduced protein modelsrdquo SAR and QSAR inEnvironmental Research vol 20 no 5-6 pp 415ndash427 2009

[10] M Randic K Mehulic D Vukicevic T Pisanski D Vikic-Topic and D Plavsic ldquoGraphical representation of proteins asfour-color maps and the irnumerical characterizationrdquo Journalof Molecular Graphics and Modelling vol 27 no 5 pp 637ndash6412009

[11] F Bai and TWang ldquoOn graphical and numerical representationof protein sequencesrdquo Journal of Biomolecular Structure andDynamics vol 23 no 5 pp 537ndash545 2006

[12] M Randic ldquo2-D Graphical representation of proteins based onphysico-chemical properties of amino acidsrdquo Chemical PhysicsLetters vol 440 no 4ndash6 pp 291ndash295 2007

[13] A Ghosh and A Nandy ldquoGraphical representation and mathe-matical characterization of protein sequences and applicationsto viral proteinsrdquo Advances in Protein Chemistry and StructuralBiology vol 83 pp 1ndash42 2011

[14] C Li X Yu L Yang X Zheng and Z Wang ldquo3-D maps andcoupling numbers for protein sequencesrdquo Physica A StatisticalMechanics and its Applications vol 388 no 9 pp 1967ndash19722009

[15] M Randic J Zupan and D Vikic-Topic ldquoOn representation ofproteins by star-like graphsrdquo Journal of Molecular Graphics andModelling vol 26 no 1 pp 290ndash305 2007

[16] C Li L Xing and X Wang ldquo2-D graphical representation ofprotein sequences and its application to coronavirus phylogenyrdquoJournal of Biochemistry and Molecular Biology vol 41 no 3 pp217ndash222 2008

[17] J Wen and Y Zhang ldquoA 2D graphical representation of proteinsequence and its numerical characterizationrdquo Chemical PhysicsLetters vol 476 no 4ndash6 pp 281ndash286 2009

[18] Z-C Wu X Xiao and K-C Chou ldquo2D-MH a web-server forgenerating graphic representation of protein sequences basedon the physicochemical properties of their constituent aminoacidsrdquo Journal of Theoretical Biology vol 267 no 1 pp 29ndash342010

[19] B Liao X Sun and Q Zeng ldquoA novel method for similarityanalysis and protein sub-cellular localization predictionrdquo Bioin-formatics vol 26 no 21 pp 2678ndash2683 2010

[20] M Novic and M Randic ldquoRepresentation of proteins as walksin 20-D spacerdquo SAR and QSAR in Environmental Research vol19 no 3-4 pp 317ndash337 2008

[21] Z-H Qi J Feng X-Q Qi and L Li ldquoApplication of 2Dgraphic representation of protein sequence based on Huffmantree methodrdquo Computers in Biology and Medicine vol 42 no 5pp 556ndash563 2012

Computational and Mathematical Methods in Medicine 15

[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012

[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012

[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008

[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002

[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004

[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005

[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013

[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010

[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012

[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013

[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014

[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004

[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988

[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004

[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002

[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000

[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996

[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical

Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000

[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003

[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013

Page 15: ADLD: A Novel Graphical Representation of Protein

Computational and Mathematical Methods in Medicine 15

[22] H-J Yu and D-S Huang ldquoNovel 20-D descriptors of proteinsequences and itrsquos applications in similarity analysisrdquo ChemicalPhysics Letters vol 531 pp 261ndash266 2012

[23] P-A He J Wei Y Yao and Z Tie ldquoA novel graphical repre-sentation of proteins and its applicationrdquo Physica A StatisticalMechanics and its Applications vol 391 no 1-2 pp 93ndash99 2012

[24] M Randic M Novic andM Vracko ldquoOn novel representationof proteins based on amino acid adjacency matrixrdquo SAR andQSAR in Environmental Research vol 19 no 3-4 pp 339ndash3492008

[25] Z-P Feng andC-T Zhang ldquoA graphic representation of proteinsequence andpredicting the subcellular locations of prokaryoticproteinsrdquo International Journal of Biochemistry and Cell Biologyvol 34 no 3 pp 298ndash307 2002

[26] M Randic J Zupan and A T Balaban ldquoUnique graphicalrepresentation of protein sequences based on nucleotide tripletcodonsrdquo Chemical Physics Letters vol 397 no 1ndash3 pp 247ndash2522004

[27] F Bai and T Wang ldquoA 2-D graphical representation of proteinsequences based on nucleotide triplet codonsrdquoChemical PhysicsLetters vol 413 no 4ndash6 pp 458ndash462 2005

[28] Y-h Yao F Kong Q Dai and P-a He ldquoA sequence-segmentedmethod applied to the similarity analysis of long proteinsequencerdquo MATCH Communications in Mathematical and inComputer Chemistry vol 70 no 1 pp 431ndash450 2013

[29] M I Abo el Maaty M M Abo-Elkhier and M A AbdElwahaab ldquo3D graphical representation of protein sequencesand their statistical characterizationrdquo Physica A StatisticalMechanics and Its Applications vol 389 no 21 pp 4668ndash46762010

[30] M M Abo-Elkhier ldquoSimilaritydissimilarity analysis of proteinsequences using the spatial median as a descriptorrdquo Journal ofBiophysical Chemistry vol 3 pp 142ndash148 2012

[31] A El-Lakkani and S El-Sherif ldquoSimilarity analysis of proteinsequences based on 2D and 3D amino acid adjacencymatricesrdquoChemical Physics Letters vol 590 pp 192ndash195 2013

[32] T Ma Y Liu Q Dai Y Yao and P-A He ldquoA graphicalrepresentation of protein based on a novel iterated functionsystemrdquo Physica A Statistical Mechanics and its Applicationsvol 403 pp 21ndash28 2014

[33] D M Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2nd edition 2004

[34] D G Higgins and P M Sharp ldquoCLUSTAL a package forperforming multiple sequence alignment on a microcomputerrdquoGene vol 73 no 1 pp 237ndash244 1988

[35] R C Edgar ldquoMUSCLE multiple sequence alignment with highaccuracy and high throughputrdquo Nucleic Acids Research vol 32no 5 pp 1792ndash1797 2004

[36] K Katoh K Misawa K-I Kuma and T Miyata ldquoMAFFT anovel method for rapid multiple sequence alignment based onfast Fourier transformrdquo Nucleic Acids Research vol 30 no 14pp 3059ndash3066 2002

[37] C Notredame D G Higgins and J Heringa ldquoT-coffee a novelmethod for fast and accurate multiple sequence alignmentrdquoJournal of Molecular Biology vol 302 no 1 pp 205ndash217 2000

[38] MA BalseraWWriggers YOono andK Schulten ldquoPrincipalcomponent analysis and long time protein dynamicsrdquo Journal ofPhysical Chemistry vol 100 no 7 pp 2567ndash2572 1996

[39] B Hess ldquoSimilarities between principal components of proteindynamics and random diffusionrdquo Physical Review EmdashStatistical

Physics Plasmas Fluids and Related Interdisciplinary Topicsvol 62 no 6B pp 8438ndash8448 2000

[40] A L Tournier and J C Smith ldquoPrincipal components of theprotein dynamical transitionrdquo Physical Review Letters vol 91no 20 Article ID 208106 2003

[41] K Tamura G Stecher D Peterson A Filipski and S KumarldquoMEGA6molecular evolutionary genetics analysis version 60rdquoMolecular Biology and Evolution vol 30 no 12 pp 2725ndash27292013