14
Research Article Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures Anuj Sharma and Elias S. Manolakos Department of Informatics and Telecommunications, University of Athens, Athens, Greece Correspondence should be addressed to Elias S. Manolakos; [email protected] Received 24 July 2015; Revised 4 October 2015; Accepted 5 October 2015 Academic Editor: Yudong Cai Copyright © 2015 A. Sharma and E. S. Manolakos. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a soſtware framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel’s experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel’s Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high -measure of 0.91 on the benchmark CK34 dataset. e soſtware implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. 1. Introduction Proteins are polypeptide chains that take complex shapes in three-dimensional space. It is known that there is a strong correlation between the structure of a protein and its function [1]. Moreover, beyond evolutionary relationships encoded in the sequence, proteins’ structure presents evidence of homology even in sequentially divergent proteins. Compar- ison of the structure of a given (query) protein with that of many other proteins in a large database is a common task in Structural Bioinformatics [2]. Its objective is to retrieve proteins, with those of similar structure to the query being ranked higher in the results list. Protein Structure Comparison (PSC) is critical in homol- ogy detection, drug design [3], and structure modeling [4]. In [5–7] the authors list several PSC methods varying in terms of the algorithms and similarity metrics used, yielding different, but biologically relevant results. ere is currently no agreement on a single method that is superior for protein structures comparison [8–13]. Hence, the modern trend in the field is to perform Multicriteria Protein Structure Com- parison (MCPSC), that is, to integrate several PSC methods into one application and provide consensus results [14]. e approach banks on the idea that an ensemble of classifiers can yield better performance than any one of the constituent classifiers alone [15–17]. Pressing demand for computing power in the domain of protein structures comparison is the result of three factors: ever increasing size of structure databases [18], high com- putational complexity of PSC operations [19], and the trend towards applying multiple criteria PSC at a larger scale. So far this demand has been met using distributed computing platforms, such as clusters of workstations (COWs) and computer grids [8, 14, 20]. While distributed computing is popularly used in PSC, the parallel processing capabilities of modern and emerging processor architectures, such as Graphics Processing Units (GPUs), multi- and many-core Central Processing Units (CPUs) have not been tapped. Hindawi Publishing Corporation BioMed Research International Volume 2015, Article ID 563674, 13 pages http://dx.doi.org/10.1155/2015/563674

Research Article Efficient Multicriteria Protein Structure ...mark MPBT data. Legacy soware by default does not mark memory as MPBT and runs without modi cations. If the bit is set,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Article Efficient Multicriteria Protein Structure ...mark MPBT data. Legacy soware by default does not mark memory as MPBT and runs without modi cations. If the bit is set,

Research ArticleEfficient Multicriteria Protein Structure Comparison onModern Processor Architectures

Anuj Sharma and Elias S Manolakos

Department of Informatics and Telecommunications University of Athens Athens Greece

Correspondence should be addressed to Elias S Manolakos eliasmdiuoagr

Received 24 July 2015 Revised 4 October 2015 Accepted 5 October 2015

Academic Editor Yudong Cai

Copyright copy 2015 A Sharma and E S Manolakos This is an open access article distributed under the Creative CommonsAttribution License which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factorsrapidly expanding structural proteomics databases high computational complexity of pairwise protein comparison algorithms andthe trend in the domain towards usingmultiple criteria for protein structures comparison (MCPSC) and combining resultsWe havedeveloped a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modernprocessors based on three popular PSC methods namely TMalign CE and USM We evaluate and compare the performance andefficiency of the two parallel MCPSC implementations using Intelrsquos experimental many-core Single-Chip Cloud Computer (SCC)as well as Intelrsquos Core i7 multicore processor We show that the 48-core SCC is more efficient than the latest generation Core i7achieving a speedup factor of 42 (efficiency of 09) making many-core processors an exciting emerging technology for large-scalestructural proteomics We compare and contrast the performance of the two processors on several datasets and also show thatMCPSC outperforms its component methods in grouping related domains achieving a high 119865-measure of 091 on the benchmarkCK34 dataset The software implementation for protein structure comparison using the three methods and combined MCPSCalong with the developed underlying rckskel algorithmic skeletons library is available via GitHub

1 Introduction

Proteins are polypeptide chains that take complex shapes inthree-dimensional space It is known that there is a strongcorrelation between the structure of a protein and its function[1] Moreover beyond evolutionary relationships encodedin the sequence proteinsrsquo structure presents evidence ofhomology even in sequentially divergent proteins Compar-ison of the structure of a given (query) protein with that ofmany other proteins in a large database is a common taskin Structural Bioinformatics [2] Its objective is to retrieveproteins with those of similar structure to the query beingranked higher in the results list

Protein Structure Comparison (PSC) is critical in homol-ogy detection drug design [3] and structure modeling [4]In [5ndash7] the authors list several PSC methods varying interms of the algorithms and similarity metrics used yieldingdifferent but biologically relevant results There is currentlyno agreement on a single method that is superior for protein

structures comparison [8ndash13] Hence the modern trend inthe field is to perform Multicriteria Protein Structure Com-parison (MCPSC) that is to integrate several PSC methodsinto one application and provide consensus results [14] Theapproach banks on the idea that an ensemble of classifierscan yield better performance than any one of the constituentclassifiers alone [15ndash17]

Pressing demand for computing power in the domain ofprotein structures comparison is the result of three factorsever increasing size of structure databases [18] high com-putational complexity of PSC operations [19] and the trendtowards applying multiple criteria PSC at a larger scale Sofar this demand has been met using distributed computingplatforms such as clusters of workstations (COWs) andcomputer grids [8 14 20] While distributed computing ispopularly used in PSC the parallel processing capabilitiesof modern and emerging processor architectures such asGraphics Processing Units (GPUs) multi- and many-coreCentral Processing Units (CPUs) have not been tapped

Hindawi Publishing CorporationBioMed Research InternationalVolume 2015 Article ID 563674 13 pageshttpdxdoiorg1011552015563674

2 BioMed Research International

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

System IF

Core 1

Core 0

Router MPB

Tile

Mem

ory

cont

rolle

r

Mem

ory

cont

rolle

r

Mem

ory

cont

rolle

r

Mem

ory

cont

rolle

rL2$0

L2$1

Figure 1 The Intel Single-Chip Cloud Computer It is a network on chip processor architecture having a mesh of 4 by 6 = 24 routers (R)connecting the 24 ldquotilesrdquo with 2 cores per tile [28]

These parallel processing architectures have become morereadily available [21 22] and instances of their use arebeginning to appear in the broader field of biocomputing[23 24] These processor architectures can in principle beused additively to meet the ever increasing computationaldemands of MCPSC by complementing already in use dis-tributed computing approaches [25]

Multicore and many-core CPUs as opposed to GPUsretain backward compatibility to well-established program-ming models [26] and offer the key advantage of usingprogramming methods languages and tools familiar tomost programmers Many-core processors differ from theirmulticore counterparts primarily on the communicationsubsystem Many-core CPUs use a Network-on-Chip (NoC)while multicore CPUs use bus-based structures [27] TheSingle-Chip Cloud Computer (SCC) experimental processor[28] is a 48-core NoC-based concept-vehicle created by IntelLabs in 2009 as a platform for many-core software researchWhile multicore CPUs are ubiquitous and easily availablefor parallel processing due to the drive from leading chipmanufacturers over the past several years [29] many-coreCPUs are not as deeply entrenched or commonly availableyet However due to architectural improvements many-coreprocessors have the potential to deliver scalable performanceby exploiting a larger number of cores at a lower costof intercore communication [30 31] Moreover many-coreprocessors offer improved power efficiency (performance perwatt) due to the use of more ldquolight-weightrdquo cores [26 32] Allthese factors makemodern CPUs (multicore andmany-core)powerful technologies suitable for meeting the high com-putational demands of MCPSC computations by employingscalable parallel processing Both multicore and many-coreprocessors have been used in bioinformatics [33] mainly forpairwise or multiple sequence comparison [34 35] HoweverNoC architectures have not yet been extensively employeddespite the flexibility and parallel processing capabilities theyoffer [35]

In this work we extend the framework presented in [36]for MCPSC We augment the software with implementa-tions of two popularly used PSC methodsmdashCombinatorialExtension (CE) [37] and Universal Similarity Metric (USM)[38]mdashin addition to TMalign [39] and then combine them

to create a scalable MCPSC application for the SCC We alsodevelop an equivalent MCPSC software implementation fora modern multicore CPU We combine sequential processing(for USMPSC tasks) and distributed processing (for TMalignand CE PSC tasks) to obtain high speedup on a singlechip Our solution highlights the flexibility offered by thesearchitectures in combining the parallel processing elementsbased on the requirements of the problem We analyze thecharacteristics of the MCPSC software on the SCC with aview of achieving high speedup (efficiency) and comparethe performance to that of the multicore implementationContrasting the performance of multicore and many-coreprocessors using multiple datasets of different sizes high-lights the advantages and disadvantages of each architectureas applicable to MCPSC To the best of our knowledgethis is the first attempt in the literature to develop andcompare MCPSC implementations for multicore and many-core CPUs

2 Methods

21 Experimental Systems Used The Single-Chip CloudComputer (SCC) is an experimental Intel microprocessorarchitecture containing 48 cores integrated on a silicon CPUchip It has multiple dual times86 core ldquotilesrdquo arranged in a 4 times6 mesh memory controllers and a 24-router mesh networkas depicted in Figure 1 Although it is a ldquoconcept-vehiclerdquo[40] the technology it uses is scalable to more than 100cores on a single chip [28] The SCC resembles a cluster ofcomputer nodes capable of communicating with each otherin much the same way as a cluster of independent machinesSalient features of the SCC hardware architecture relevant toprogramming the chip are listed in Table 1

Further each core of the SCC has L1 and L2 caches Inthe SCC architecture the L1 caches (16 KB each) are on thecore while the L2 caches (256KB each) are on the tile nextto the core with each tile carrying 2 cores Further each tilealso has a small message passing buffer (MPB) of 16 KBand is shared among all the cores on the chip Hence with24 tiles the SCC provides a Message Passing Buffer Type(MPBT) memory of size 384KB The SCC therefore providesa hierarchy of memories usable by application programs for

BioMed Research International 3

Data119876 query protein structures119863 database of known protein structures Sequential processingcalc usm dist(119876119863) Parallel processingfor 119898 in [CE TMalign] do

for 119902 in 119876 dofor 119889 in119863 do

jobsadd(in PE 119899 using method119898 protein pair [119902 119889])end

endendPar calc psc dist(jobs)

Algorithm 1 A pseudo-code of the MCPSC computation implemented in this work The USM jobs are processed sequentially while CEand TMalign pairwise structure comparison jobs are executed in parallel Task par calc psc dist assigns each PSC job to a free PE (node in amany-core and thread in multicore) If n is specified the jobrsquos assignment is fixed otherwise the next free PE available is employed

Table 1 Salient features of the SCC Chip by Intel

Corearchitecture

6 times 4 mesh 2 Pentium P54c (times86) coresper tile

Local cache 256KB L2 Cache 16 KB shared MPB pertile

Main memory 4 iMCs 16ndash64GB total memory

different purposes including processing and communicationA reserved bit in the page table of the cores is used tomark MPBT data Legacy software by default does not markmemory as MPBT and runs without modifications If the bitis set data is cached in L1 and bypasses L2

To benchmark performance we run all-to-all PSC usingdifferent methods on a single and on multiple cores of theSCC Each core is a P54C Intel Pentium (32-bit) running SCCLinux We have also used a PC with a 3GHz AMD AthlonII X2 250 64-bit processor and 3GB RAM running DebianSid to analyze the characteristics of individual PSC methodsand develop PSC job partitioning schemes Lastly amulticorebaseline was established by running all-to-all MCPSC usingthe multithreaded version of MCPSC on a Intel Core i7-4771 ldquoHaswelrdquo Quad-Core CPU running at 35 GHz with8GB of RAM and a SSD running Debian Sid The Core i7CPU features highly optimized out-of-order execution andHT (Hyper Threading) [41] Intels flavor of SimultaneousMultithreading (SMT) All software developed was compiledwith the GNU C Compiler version 48

22 Software Framework for Implementing PSC Methods Weintroduced in [36] a framework for porting a PSC methodto the SCC and have used it to implement a master-slavesvariant of the TMalign PSC method [39] The frameworkallows developing efficient parallel implementations usingbasic functionality provided by our rckskel skeletons libraryfor a many-core processor Algorithmic skeletons allowsequential implementations to be parallelized easily without

specifying architecture dependent details [42] By nestingand combining skeletons the desired level of parallelism canbe introduced into different PSC methods While severalalgorithmic skeleton libraries have been developed [42 43]none of them has been targeted specifically for many-coreprocessor architectures Using rckskel allowed efficient usageof the SCC cores to speed up all-to-all PSC In this work weextend the framework to handle MCPSC while allowing themethods to be combined serially where required

In Algorithm 1 we provide the pseudo-code of the Mul-ticriteria PSC application for the general case of comparingevery element in a set of query proteins119876with every elementin a set of a reference database of proteins 119863 using threePSC methods It covers both the ldquoall-to-allrdquo comparisonwhere the set of query proteins is the same as the setof database proteins and the ldquomany-to-manyrdquo comparisonwhere the query proteins are different from the databaseproteins We used the MCPSC application developed tomeasure completion times of all-to-all PSCThe size of the joblist generated for parallel processing consists of119870119876times119870119863times119872PSC jobs where 119870119876 (119870119863) is the number of proteins in set119876 (119863) respectively and 119872 is the number of PSC methodsimplemented using parallel processing (119872 = 2 currently)This is the most fine grained job distribution that can beobtained without parallelizing the pairwise PSC operation

The rckskel library [36] was upgraded in this work toC++ allowing for more flexibility in usage The choice of theprogramming language was motivated by two requirementsseamless handling of communication data and type safetyIn order to perform parallel operations data must be passedbetween the cores of the SCC In addition to the controlstructures built into the library this includes user data Sincethe user data can vary depending on the user application thelibrary needs to provide support for any data structures tobe passed between the cores This was achieved by using theboost object serializationdeserialization construct whichallows the library to accept any arbitrary data type from theuser as long as it can be serialized By making use of the C++

4 BioMed Research International

template extension the skeletons in rckskel are completelytype safe This implies that any inconsistency in the dataflow described by the user while constructing the skeletonhierarchy is caught at compile time instead of run time whichsignificantly simplifies developing complex parallel programsusing the library

23 MCPSC Application Development for Many-Core Proces-sor In order to perform Multicriteria PSC an applicationwas developed using our software framework incorporatingthree PSC methods For the CE and USM methods weported existing software to the SCC The implementation ofCE builds on existing C++ sources which require a smallstructure manipulation library as an external binary Thislibrary was modified in order to link it to the main CEcode and deprecate the need for the program to run as anexternal binary Further the code was modified to compareonly the first chains of the domains as in TMalign Wethen generated amaster-slaves parallel implementation of theCC++ code using rckskel similar to the master-slaves portwe have developed for TMalign in [36] We also developeda C++ implementation of USM based on the existing Javasources [38] so that we could integrate it with the restof the software As per the existing implementation theKolmogorov complexity of the proteins was approximatedwith the gzip of their contact maps Therefore we takeit that the normalized compression distance between thecontact maps of a pair of proteins expresses their USMsimilarity In the final implementation the master processexecutes the USM jobs sequentially due to the fast processingtimes for these jobs Subsequently the master distributesthe CE and TMalign PSC jobs to the available slaves tobe executed in parallel on different cores Moreover weextended significantly the previous implementation [36] byadding load balancing The software framework supportsboth dynamic round-robin [44] and static jobs partitioningthus allowing experimentation with different methods andcomparison in order to select the best approach for the givenmany-core processor and dataset

24 MCPSC Application Development for Multicore ProcessorIn addition to implementing the MCPSC framework for theSCCwe also developed an implementation formulticore pro-cessors Intercore communication calls handled in the NoCimplementation by rckskel are replaced by reference passingbetween multiple-threads implemented using OpenMP Theresulting software combines the three methods USM CEand TMalign in the same way as the NoC implementationThe USM jobs are therefore processed serially while CE andTMalign are processed in parallel Since the three methodsare combined into one executable the structure data for eachprotein domain is loaded only once and reused whenever thedomain appears in a pairwise comparison being performedThis reduces the overall processing time as compared torunning each method currently available software whereonly pairs are accepted thereby requiring multiple reloads ofeach protein domain to perform all-to-all analysis (119873 + 1times for each domain if the dataset contains 119873 domains)

Further the software allows direct comparison between theperformance of many-core and multicore processors for theMCPSC problem because all other factors algorithms anddata remain constant

25 MCPSC Consensus Score Calculation In both imple-mentations for each protein pair a total of 119872 comparisonscores are returned where119872 is the number of PSC methodsemployed These results are stored in a 119875 times119872matrix where119875 is the total number of protein pairs in a dataset Individualmethod comparison scores may or may not be normalizeddepending on the PSC method used therefore we normalizeall scores in the matrix between 0 and 1 along each column(feature scaling [45]) that is along each PSM method usedThe normalization process proceeds as follows First themin-imum and maximum values for each column of the matrixcontaining the dissimilarity values are determined Thenthe matrix is scaled by dividing the difference of columnelements and the column minimum value by the differenceof the columnmaximum andminimum value as shown in (1)(where 119894 is the row index and 119895 is the column index) Notethat the minimum value of any column cannot be less thanzero because these are PSC scores with a minimum value ofzero Once the columns have been normalized a consensusMCPSC score is calculated for each protein pair by taking theaverage of the 119872 values corresponding to that protein pairIt must be noted that the smaller the pairwise PSC score thehigher the similarity between the pair proteins

1198831015840

119894119895 =

119883119894119895 minusmin (119883119895)

max (119883119895) minusmin (119883119895) (1)

26 Datasets Used Several datasets listed in Table 2 wereused in this work The table includes statistics about thelength of the protein domains and the distribution of theSCOP [46] classifications in each dataset It can be seen thatthere is significant variation in the sizes of the datasets thelengths of protein domains they contain and the distributionof the SCOP classifications For each dataset we retaineddomains where the PDB file could be downloaded and aclassification could be found for the domain in SCOP v175Further the scripts available for download with the USMsources were used to extract the contact maps from the PDBfiles The Chew-Kedem (CK34) and the Rost-Sander (RS119)datasets were used to develop the load balancing schemesfor the SCC and to study the speedup (throughput) on thei7 processor Further all the datasets were used to comparethe performance of the SCC with that of the i7 on all-to-allMCPCS processing

27 F-Measure Analysis for the Chew-Kedem Dataset Weobtained all-against-all comparison scores for all comparisonmethods and based on these scores we performed pairwisehierarchical clustering for each method using hclust fromthe stats package in 119877 [47] We calculated the 119865-measureof the hierarchical clustering for each PSC methods (TMa-lign USM CE and MCPSC) The 119865-measure is calculated

BioMed Research International 5

Table 2 Basic statistics of the domains in the datasets used in this work The table includes the number of SCOP families superfamilies(SpFams) and folds as well as the total number of domains (No) the Minimum (Min) Maximum (Max) Median and Mean and StandardDeviation (Std) of the domain lengths in each dataset

Dataset No Domain lengths SCOP classificationsMin Max Median Mean Std Families SpFams Folds

Skolnick [57] 33 97 255 158 1677 627 5 5 5Chew-Kedem [58] 34 90 497 147 1795 1001 10 9 9Fischer [59] 68 62 581 181 2206 1256 56 44 40Rost-Sander [60] 114 21 753 167 1932 1234 93 85 71Lancia [61] 269 64 72 68 679 24 79 72 57Proteus [62] 277 64 728 239 2476 1167 53 47 41

Table 3 Time required for the baseline all-to-all PSC task using the TMalign CE and USM PSC methods on the PC and a single core of theSCC The table also shows the time required for the all-to-all MCPSC task (where all three PSC methods are used) All times are in seconds

MethodPorted software on PC Ported software on SCC 1 core

CK34 RS119 CK34 RS119Load Processing Load Processing Load Processing Load Processing

TMalign 001 127 005 1725 030 2514 1 33452CE 080 374 3 6459 14 7152 50 132205USM 001 060 004 7 070 30 2 345All 080 502 3 8191 15 9697 53 166000

using (2) where 119862 is the number of clusters (119862 = 5 in thiswork) as described in [48]

True Positives (TP)

=

119862

sum

119894=1

number of correctly assigned domains

False Positives (FP)

=

119862

sum

119894=1

number of incorrectly assigned domains

False Negatives (FN)

=

119862

sum

119894=1

number of missing domains

Precision (P) = TPTP + FP

Recall (R) = TPTP + FN

F-measure (F) = 2 lowast 119875 lowast 119877119875 + 119877

(2)

3 Results and Discussion

31 Baseline All-to-All PSC Table 3 provides the times takento perform all-to-all comparison on the PC and on a singlecore of the SCC with the CK34 and RS119 datasets Thesoftware took longer to load data and process jobs on a singlecore of the SCC as compared to the PC The longer data

loading times are due to the need for network IO since thedata is stored on the Management Console PC (MCPC) andaccessed by the SCC via NFS In long running services how-ever data is loaded once and used when needed thereforethis time is disregarded in our performance comparisonsThedifferences in processing times are due to the architecturedifference of the processor cores (times86-64 versus times86) andtheir operating frequencies (3GHz versus 533MHz) on thePC and a single SCC core respectively

We performed a statistical analysis using SecStAnT [49]of the proteins participating in the 10 slowest and the 10 fastestpairwise PSC tasks (using TMalign and CE) using the CK34and RS119 datasetsThe parameters on which SecStAnT com-pares proteins are their secondary structure characteristicsBond Angle Dihedral Angle Theta and Psi Results of the 1-parameter and 2-parameter statistical analysis distributionsand correlations did not show any significant statistical differ-ence between the slowest and fastest PSC domains Howeverthe slowest pairs contain the Hlx310Alpha supersecondarystructure which is absent from the fastest pairs The 310-helixoccurs rarely in protein domains because the tight packingof the backbone atoms makes it energetically unstable Thisstructure might be responsible for the increased alignmenttimes in certain domains Further we compared the proteinsusing their UniPort entries and observed that most proteinsin the slowest set are Cytoplasmic while most members inthe fastest set are membrane proteins

32 Relation of Expected PSC Times to the Lengths of theProteins underComparison Comparison of the pairwise PSCprocessing with respect to the protein lengths as shown inFigure 2 reveals some interesting facts the time requiredfor completing a pairwise PSC task is clearly a function of

6 BioMed Research International

Table 4 Sum of squares of residuals after curve fitting for the different PSC methods The table shows that the sum of residuals is lower withquadratic fit (square) as compared to linear fit All values are multiplied with 1119890 + 06

PSC methodResiduals (sum of lengths) Residuals (product of lengths)

CK34 RS119 CK34 RS119Linear Square Linear Square Linear Square Linear Square

TMalign 4 1 100 90 1 1 40 30CE 200 100 4000 1000 100 50 2000 800

0123456789

01 02 03 04 05 06 07 08 09 1

Tim

e (se

c)

0123456789

Tim

e (se

c)

Sum of lengths of pair proteins

01 02 03 04 05 06 07 08 09 1

CK34 dataset

CETMalign

CETMalign

CETMalign

CK34 dataset

0

5

10

15

20

25

0 01 02 03 04 05 06 07 08 09 1Ti

me (

sec)

Tim

e (se

c)Sum of lengths of pair proteins

RS119 dataset

CETMalign

00 01 02 03 04 05 06 07 08 09 1Product of lengths of pair proteinsProduct of lengths of pair proteins

RS119 dataset25

20

15

10

5

0

Figure 2 Comparison of pairwise PSCprocessing times usingTMalign andCEwith respect to the normalized sumof lengths andnormalizedproduct of lengths of pair proteins The dotted lines show the quadratic best fit for the data

the lengths of the proteins forming the pair USM requiresextremely small pairwise comparison times as shown inTable 1 and was thus excluded from this analysis The bestfit is achieved when using the product of lengths of the pairproteins as an attribute and is better than the fit achievedwhen using the sum of the lengths Moreover the quadraticcurve fitting results are better than the linear as shown inTable 4 For a given PSC method the relative times requiredfor completing a pairwise PSC task depend on the proper-ties of the pairs participating However the absolute timerequired for completing a pairwise PSC task depends also onthe complexity of the method used Since the complexity ofone pairwise PSC method as compared to another cannot be

known a priori it cannot be used as a factor for a generic jobspartitioning scheme for load balancing purposes

33 Speedup and Efficiency Analysis of Multiple Criteria PSCon the SCC NoC CPU In this work we experimented withboth round-robin and job partitioning schemes for loadbalancing Given the close correlation between lengths of pairproteins and processing timeswe used this as criteria for bothRound-robin [44] is a dynamic load balancing method inwhich the master process maintains a list of jobs and handsthe next job in the list to the next free slave process It isworth noting that if the list is presorted based on attributesthat are known to be proportional to the processing times this

BioMed Research International 7

Table 5 Comparison of speedup and efficiency achieved by the load balancing schemes in processing the all-to-all MCPSC task vis-a-vie thesingle core (SCC) processing times The efficiency is calculated assuming 47 processing elements (one PE serves as master) All times are inseconds

Load balancing scheme Dataset CK34 Dataset RS119Time Speedup Efficiency Time Speedup Efficiency

Serial (1 SCC core) 9698 mdash mdash 166000 mdash mdashRandom 479 20 043 4765 35 074Greedy partitioning (sum) 409 24 050 4300 39 082Greedy partitioning (product) 396 28 059 4433 38 080Round-robin (sum sorted) 239 41 086 3930 42 090Round-robin (product sorted) 238 41 086 3930 42 090

would naturally result in little to no idle times for the slaveprocesses For static job partitioning determining job-to-coremapping is equivalent to constructing119873 equal partitionsfor a list of elements which is an NP-hard problem [50]In our case 119873 is the number of cores and is equal to 47when all SCC cores are used We investigated methods forPSC jobs partitioning based on the sum and the product ofthe lengths of the two proteins to be compared We havestatically partitioned the PSC jobs using as attribute the sumof lengths (product of lengths) of the paired proteins to becompared The partitions are generated using the LongestProcessing Time (LPT) algorithm the jobs are sorted indescending order by sum of lengths (product of lengths)and then assigned to the partition with the smallest totalrunning sumThese partitions will be referred to as ldquoGreedyrdquopartitions fromnow on Since the job partitions are processedon different cores this strategy is expected to balance theprocessing times by reducing idle times

Figure 3 shows the partitions generated using a randompartitioning and a greedy partitioning scheme based on thesum of lengths and the product of lengths as attributes Parti-tions created with the random scheme with equal number ofpair proteins per partition vary greatly in terms of the totalnormalized sum of the sum (or product) of the pair proteinlengths The size of this normalized sum is expected to beproportional to the overall processing time each partitionwillrequire when assigned to a core Therefore if each partitionof PSC tasks created with the random scheme is processedon a different core several cores will have significant idletimes at the end On the contrary partitions created with thegreedy scheme are almost equal as to the normalized sum ofthe sum (or product) of the pair protein lengths Hence ifeach partition of PSC tasks created by the greedy scheme isprocessed on a different processing element (PE) idle timesare expected to be minimized

We extended the framework introduced in [36] by incor-porating more PSC methods (CE and USM) in additionto TMalign and computing consensus scores in order toperform Multicriteria PSC We retain the master-slaves pro-cessing model in the extended framework the first availablecore runs the master process and all other cores run slaveprocesses The master loads the data and generates a list ofjobs each one involving a single pairwise Protein StructureComparison operation to be performed using a specifiedPSC method Due to the vastly different processing times

0163248

0163248

0163248

0 2 4 6 8 10 12

Cor

e Id

Random split CK34 dataset

0 05 1 15 2 25 3 35 4

0 05 1 15 2 25 3 35 4

Cor

e Id

Greedy (sum) split CK34 dataset

Cor

e Id

Greedy (prod) split CK34 dataset

Figure 3 The 47 partitions created from list of PSC tasks sortedrandomly or by the sum (product) of lengths of pair proteins Eachhorizontal line represents the sum of the normalized lengths of theprotein pairs assigned to that partition

of PSC operations when using different methods we allowthe master process (a) to make a choice between sequentialversus parallel job processing and (b) implement a loadbalancing scheme for the slavesThe first allows the master tosequentially process jobs of a PSC method when distributingthem which would actually result in an increased processingtime The second allows the master to reduce idle times inslave processes which perform PSC continuously receivingjobs from and returning the results to the master till theyreceive a terminate signal

The consensus value for a protein pair is calculated usingthe scores from the three PSC methods (a) the USM-metricfrom USM (b) the TM-score from TMalign and (c) the RootMean Square Distance (RMSD) from CE The USM-metricand RMSD are distances and have a zero value for identicalstructures The TM-score is therefore inverted since it is theonly one of the three which is a similarity metric At the endthe average of the three normalized dissimilarity scores istaken as the MCPSC score for the protein pair

In Table 5 we compare the performance of the master-slaves setup with one master and 47 slaves to that of a singlecore of the SCC and show the speedup and the efficiencyachieved using the CK34 and RS119 datasets TMalign andCE jobs were run under the master-slaves model while USM

8 BioMed Research International

Table 6 Speedup efficiency and throughput in pairwise-PSC tasks per second for performing MCPSC on an Intel Core i7 multicore CPUusing the CK34 and RS119 datasets

Threads Dataset CK34 Dataset RS119Speedup Efficiency Throughput Speedup Efficiency Throughput

1 100 100 492 100 100 3492 188 094 927 146 073 5093 204 068 1001 198 066 6904 254 063 1247 260 065 9085 270 054 1325 270 054 9426 271 045 1332 271 045 9457 274 039 1347 269 038 9388 264 033 1298 267 033 931

jobs were processed by the master due to the negligible timerequired by these jobs

The greedy static partitioning scheme improves on theperformance of the random partitioning scheme but it isinadequate for the SCC where the cost of data transferbetween cores is low This scheme would be of interestin clusters of workstations where interconnection networksare slow The greedy partitioning scheme could also bebeneficial on larger many-core processors where the mastermay become a bottleneck making off-line creation of batchesof PSC tasks essential In such a scenario however othertechniques such as work-stealing would need to be assessedbefore selecting the most appropriate method

The round-robin dynamic job assignment strategy out-performs the best offline partitioning scheme Furthermorethe sorted round-robin flawlessly balances out the cores pro-cessing times on both datasets making it the best approachfor distributing jobs to the cores of the SCC In this approachthe protein pairs are presorted in descending order of product(or sum) of pair protein lengths Specifically for the one-to-many PSC case we simply retrieve the database proteins indescending order of length and create pairwise PSC taskswiththe query protein For the many-to-many PSC case howeverwe need to determine the correct order of the proteins whenthe query proteins are received

The sorted round-robin strategy distributes the jobs mostefficiently for the proposed master-slaves setup on the IntelSCC with 47 slaves This setup achieves a 42-fold (40-fold)speedup for the RS119 (CK34) dataset as compared to a singlecore of the SCCThe speedup is almost linear which suggeststhat high efficiency can be achieved evenwhen themany-coreprocessor has more nodes A bigger many-core processorhowever may require using a hierarchy of masters to avoida bottleneck on a single master node In such a scenarioa combination of partitioning schemes and round-robinjob assignment could be used to distribute and minimizeidle times For instance on larger NoC processors a well-balanced solution may require the use of clusters of process-ing elements concurrently processing PSC jobs with differentmethods All PEs computing jobs of a PSC method wouldreceive jobs from a specific submaster All such submasterswould in turn receive jobs from a common global master

34 Comparison of Multicriteria PSC Performance on Multi-core and Many-Core CPUs In order to perform this exper-iment the MCPSC software framework was retargeted tothe Intel Core i7 multicore processor using openmp [51] toimplement multithreading This experiment allowed us toassess how the MCPSC problem scales with increasing num-ber of cores on a modern multicore CPU readily availablefor scientists and engineers This multicore software versionuses shared memory to replace the RCCE based messagepassing (recv and send) calls used in the many-core versiondeveloped for the Intel SCC As shown in Table 6 we observespeedup on this multicore processor when running all-to-allMCPSCwith the CK34 and RS119 datasets up to the 4 threadsconfigurations Thereafter a steep speedup drop (efficiencyloss) is observed

This is attributed to the fact that this is a quad-coreprocessor that implements Hyper Threading (HT) [41] Sinceall threads operate on exactly the same type of instruc-tions workload (SPMD) it cannot take advantage of eachcorersquos super scalar architecture that needs varied workloadplacement (eg integer and floating point arithmetic at thesame time) to show performance advantages when utilizingmore threads than cores Further an SMP Operating System(OS) like the one we run on our test system uses allsystem cores as resources for swapping threads in and outA thread executing on core 1198751198641 for some number of cyclesmay get swapped out and later resume execution on core1198751198642 Thus the thread cache on core 1198751198641 becomes useless andit gets a lot of cache misses when restarting on 1198751198642 ThisOS behavior impacts cache performance and reduces theoverall application performance In terms of pairwise-PSCtasks per second (pps) the Intel Core i7 achieves a through-put of 1347 pps (945 pps) on the CK34 (RS119) dataset ascompared to 553 pps (363 pps) achieved by the SCC Thisis due to the fact that the multicore CPU contains latestgeneration cores featuring a highly superior out-of-ordermicroarchitecture and clocked at almost 75x the frequencyof the SCC We believe that even more performance can beexploited from next generation multicore processors shouldthey start introducing hardware memory structures like theMessage Passing Buffer (MPB) of the SCC NoC processoralongside their cores for increased communication efficiency

BioMed Research International 9

Table 7 Comparison of the SCC and i7 CPU in terms of throughputdelivered on all-to-all MCPSC

Dataset Pairs SCCthroughput

i7throughput Ratio

Skolnick 1089 605 2792 462Chew-Kedem 1156 503 1329 264Fischer 4624 198 937 474Rost-Sander 12996 330 938 285Lancia 72361 mdash 22265 mdashProteus 76729 mdash 677 mdash

among them without resorting to shared memory and itsunavoidable locks or cache-coherency protocol overheads

Finally Table 7 shows the comparative performance ofthe SCC (with 48 cores) and the i7 (running 7 threads)on all-to-all MCPSC using several datasets It can be seenthat the i7 outperforms the SCC consistently in terms ofthroughput The two larger datasets Lancia and Proteuswere not processed on the SCC because of the limited per-core memory (512MB) which was not sufficient to load thedomain structure data for the full dataset A decrease inthroughput was observed as the size of the dataset increaseswith some exceptions Both the SCC and the i7 deliver lowerthan expected throughput (based on its size) on the Fischerdataset which we believe is due to the higher complexity ofthe dataset As evidenced by Table 2 the Fischer dataset hasdomains belonging to 5 times more SCOP Super Families ascompared to CK34 and also has the second highest medianlength of protein domains among all the datasets used in thiswork Conversely the throughput delivered by the i7 on theLancia dataset is significantly higher than that delivered forthe similar size (in terms of number of domains) Proteusdataset We attribute this difference to the large differencebetween themean lengths of domains in each dataset as notedin Table 2

35 Qualitative Analysis of MCPSC Results Based on theChew-Kedem Dataset The acceleration capabilities offeredby modern processors (many-core and multicore) enablesperforming all-to-all PSC using different methods on large-scale datasets and compare results For example it becomeseasy to perform a qualitative analysis on a set of proteindomains and use consensus score (GPrsquoS) to explore relationbetween the protein domains categorize them into bio-logically relevant clusters [48] and automate generation ofdatabases such as Structural Classification of Proteins (SCOP)[46]

The MCPSC comparison method outperforms the com-ponent PSCmethods (TMalign CE andUSM) and groups 34representative proteins fromfivefold families into biologicallysignificant clusters In Figure 4 each protein domain is usingthe format ldquodomainName foldFamiliesrdquo The tags of theldquofoldFamiliesrdquo belong to one of (i) tb TIM barrel (ii) gglobins (iii) ab alpha beta (iv) b all beta and (v) a all alphaprotein families The clusters obtained from the MCPSCmethod show that most domains are grouped correctly

according to their structural fold In comparison all thecomponent PSCmethods produce clusters in which there aremany wrongly grouped domains Details of domain clusterallocation of all four methods are included in the Supple-mentary Materials file (see Supplementary Material availableonline at httpdxdoiorg1011552015563674) dom clustxlswith clusters generated by USM containing the highestnumber of errors The quantitative analysis evaluating theRecall versus Precision tradeoff based on the 119865-measure [52]results in a value of 119865 = 091 for the MCPSC method whichis higher than TMalign (119865 = 082) CE (119865 = 071) andUSM (119865 = 062) Finally in the Supplementary Materialfile supplementary materialpdf weprovide clusters generatedby hclust [47] for the four PSC methods (see Figures S1S2 and S3) PSC scores generated for all the datasets for allPSCmethods are included in the SupplementaryMaterial filepsc scoreszip

4 Conclusions and Future Work

In the near future we will see a continued increase inthe amount of cores integrated on a single chip as well asincreased availability of commodity many-core processorsThis trend is already visible today in that off-the-shelf PCscontain processors with up to 8 cores server grade machinesoften contain multiple processors with up to 32 cores andmulticore processors are even appearing in average grademobile devices Further the success of GPUs Tilerarsquos archi-tecture and Intelrsquos initiative for integrating more and morecores on a single chip also attest to this trend The ubiquityand the advanced core architectures employed by multicoreCPUsmake it imperative to build software that can efficientlyutilize their processing power Our experiments show that amodern Core i7 is able to deliver high throughput for all-to-all MCPSC Utilizing the performance gain delivered bymul-ticoreCPUbecomes especially importantwhen considered incombination with distributed setups such as clusters whichmay already contain nodes with multiple cores

Both multicore and many-core architectures lend them-selves to the popular and powerful MapReduce parallelprogramming model [53] If the underlying communicationlibrariesmdashtypically built on top of Message Passing Interface(MPI) [54]mdashare available this highly popular approachcan be applied to large-scale MCPSC in a manner similarto its application in other domains of bioinformatics [55]While some implementation of MapReduce is available formulticore machines it must be noted that no MapReduceframework implementations are available for the SCC Onepossible approach for using MapReduce in MCPSC wouldbe to (a) determine pairwise PSC to be performed on eachPE (pre-Map) (b) run all PSC methods on all pairs on eachPE in parallel (Map) (c) collect PSC scores from all PEsto a designated PE (shuffle) and (d) combine and generatefinal MCPSC scores (Reduce) Movement of data for taskdistribution (pre-Map) intermediate data transfer performedby the ldquoshufflerdquo mechanism and collection of results ifrequired by the Reduce step would benefit from low cost ofinter-PE communication Load balancing methods similar to

10 BioMed Research International

6xi

a_tb

1jh

g_a

1m

ba_g

1flp

_g1

ith_g

5m

bn_g

1m

yt_g

1ec

a_g

2hb

g_g

1ba

b_g

1as

h_g

3sd

h_g

2lh

b_g

1hl

b_g

1hl

m_g

2vh

b_g

1lh2

_g2

mnr

_tb

1ch

r_tb

4en

l_tb

5p2

1_a

b1

qra_

ab1

gnp_

ab6

q21

_ab

1aa9

_ab

1qf

o_b

1ne

u_b

1cd8

_b1

qa9

_b1

hnf_

b1

cdb_

b1

ci5

_b1

ct9

_ab

1cn

p_a

00

01

02

03

04

05

06

MCPSC

Figure 4 Hierarchical clustering result using the Chew-Kedem dataset and MCPSC score as distance metric between domains Each boxrepresents a cluster and the domains belonging to it The average linkage method was used to build the dendrogram

those discussed in this work could be useful in determiningeffective strategies for the distribution of PSC tasks (pre-Map) Analysis of MCPSC performance on multicore andmany-core processors withMapReduce would be required tounderstand the design trade-offs that would lead to efficientdesigns

Scalability limitation of the bus-based architecture cou-pled with the greater potential of Network-on-Chip basedprocessors to deliver efficiency and high performance impliesthat NoCs are likely to be used extensively in future many-core processors It is therefore important to start developingsoftware frameworks and solutions that capitalize on thisparallel architecture to meet the increasing computationaldemands in structural proteomics and bioinformatics ingeneral Comparison of many-core and multicore processorsin this work shows that the former can deliver higherefficiency than the latter processor architectures Furtherour experimental results make it clear that Intelrsquos SCC NoCmatches the speedup and efficiency achieved by a clusterof faster workstations [14] However a market ready NoCCPU will have a much better per watt performance ascompared to a cluster while also saving in space and infras-tructuremanagement costs Additionally any savings in wattsconsumed also reflect in savings in cooling infrastructurerequired for the hardware Furthermore with the per wattperformance being in focus for processor manufacturerssuch as Intel and Tilera this gap is set to expand evenfurther Finally many-core processors when used as the

CPUs of near-future clusters of workstations will providean additional level of local parallelism to applications whereperformance scalability is essential to tackle ever increasingcomputational demands As this study demonstrates thelikely increase in the availability and ubiquitousness of many-core processors the near linear speedup in tackling theMCPSC scenario and the ease of porting new PSC methodsto NoC based processors make many-core processors be ofgreat interest for the high performance structural proteomicsand bioinformatics communities in general

Incorporating additional PSC methods to a many-coreNoC processor is straightforward due to the familiar pro-gramming model and architecture of the individual coresOur rckskel library [36] offers convenience and parallel pro-gramming functions to further accelerate development on theNoC On future larger NoC processors optimal solutionsmayrequire the use of clusters of cores concurrently processingPSC jobs with different methods Using the rckskel librarywould facilitate the software development for such morecomplicated scenarios hiding low level details from the usersThe advantages offered by an algorithmic skeletons librarylike the rckskel make it of interest to further develop thelibrary To this end we intend to introduce communicationcontrols that will allow true blocking implementations ofldquosendrdquo and ldquorecvrdquo regardless of the communication backendlibrary used (RCCE eg provides busy-wait loops for block-ing) Such an implementation would increase the energy effi-ciency of applications built using the library Further work on

BioMed Research International 11

the librarywould involve separating the algorithmic skeletonscode from the communication subsystem so that it can beretargeted to other processors This would essentially requiresand-boxing the RCCE calls behindwell-structuredAPIs thatcan allow a different communication library to be plugged in

Further we would like to investigate the performancecharacteristics of all-to-all MCPSC on multicore and many-core processors with very large datasets Experimenting withsuch large datasets like the one used in [56] with more than6000 domains is likely to provide higher granularity andassist in more accurate assessment of the impact of differentload balancing schemes in terms of speedup delivered Addi-tionally cluster analysis on the results of such a large datasetwill provide a better measure of the biological relevance ofclusters generated by MCPSC A dataset of that size wouldgenerate more than 32 million pairwise comparison tasksper PSC method and keeping the number of methods to3 (as is the case in this work) would result in nearly 10million tasks Due to these large numbers processing sucha dataset specifically on a many-core processor would befeasible only with a larger NoC that is one which supportsmore memory and uses cores of newer architectures thanthe SCC Showing that such a large computation can becompleted in reasonable time would however further justifythe drive towards leveraging parallel processing capabilitiesof multicore and many-core processors

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Anuj Sharma contributed to the design of the study devel-oped the software framework and performed the experi-ments Elias S Manolakos conceived the study and con-tributed to the design and evaluation All authors contributedto the discussion of results andwrote and approved the paper

Acknowledgments

Mr Anuj Sharma and Dr Elias S Manolakos would like toacknowledge the support of this research by the EuropeanUnion (European Social FundESF) andGreek national fundsthrough the Operational Program ldquoEducation and LifelongLearningrdquo of the National Strategic Reference Framework(NSRF) Research Funding Program ldquoHeracleitus IIrdquo Theauthors would also like to acknowledge the MicroLab ECENational Technical University of Athens Greece for allowinguse of their SCC infrastructure In addition we thank MrDimitrios Rodopoulos and Professor Dimitrios Soudris ofNTUA for providing comments and assistance

References

[1] R A Laskowski J D Watson and J M Thornton ldquoProFunca server for predicting protein function from 3D structurerdquoNucleic Acids Research vol 33 no 2 pp 89ndash93 2005

[2] B Pang N Zhao M Becchi D Korkin and C-R ShyuldquoAccelerating large-scale protein structure alignments withgraphics processing unitsrdquo BMC Research Notes vol 5 no 1article 116 2012

[3] E Bielska X Lucas A Czerwiniec J M Kasprzak K HKaminska and J M Bujnicki ldquoVirtual screening strategies indrug design methods and applicationsrdquo Journal of Biotechnol-ogy Computational Biology and Bionanotechnology vol 92 no3 pp 249ndash264 2011

[4] H Kubinyi ldquoStructure-based design of enzyme inhibitors andreceptor ligandsrdquo Current Opinion in Drug Discovery andDevelopment vol 1 no 1 pp 4ndash15 1998

[5] G Lancia and S Istrail ldquoProtein structure comparison algo-rithms and applicationsrdquo in Mathematical Methods for ProteinStructure Analysis and Design vol 2666 of Lecture Notes inComputer Science pp 1ndash33 Springer Berlin Germany 2003

[6] I Eidhammer I Jonassen andW R Taylor ldquoStructure compar-ison and structure patternsrdquo Journal of Computational Biologyvol 7 no 5 pp 685ndash716 2000

[7] Y B Ruiz-Blanco W Paz J Green and Y Marrero-PonceldquoProtDCal a program to compute general-purpose-numericaldescriptors for sequences and 3D-structures of proteinsrdquo BMCBioinformatics vol 16 no 1 article 162 2015

[8] D Barthel J D Hirst J Błazewicz E K Burke and NKrasnogor ldquoProCKSI a decision support system for protein(structure) comparison knowledge similarity and informa-tionrdquo BMC Bioinformatics vol 8 article 416 2007

[9] M Arriagada and A Poleksic ldquoOn the difference in qualitybetween current heuristic and optimal solutions to the proteinstructure alignment problemrdquo BioMed Research Internationalvol 2013 Article ID 459248 8 pages 2013

[10] D E Robillard P T Mpangase S Hazelhurst and F DehneldquoSpeeDB fast structural protein searchesrdquo Bioinformatics vol31 no 18 pp 3027ndash3034 2015

[11] M Veeramalai D Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 no 1 article 138 2010

[12] S B Pandit and J Skolnick ldquoFr-TM-align a new proteinstructural alignment method based on fragment alignmentsand the TM-scorerdquo BMC Bioinformatics vol 9 no 1 article 5312008

[13] L Mavridis and D Ritchie ldquo3D-blast 3D protein structurealignment comparison and classification using spherical polarFourier correlationsrdquo in Proceedings of the Pacific Symposium onBiocomputing pp 281ndash292 World Scientific PublishingKamuela Hawaii USA January 2010 httphalinriafrinria-00434263en

[14] A A Shah G Folino and N Krasnogor ldquoToward high-throughput multicriteria protein-structure comparison andanalysisrdquo IEEE Transactions on Nanobioscience vol 9 no 2 pp144ndash155 2010

[15] L I Kuncheva and C J Whitaker ldquoMeasures of diversity inclassifier ensembles and their relationship with the ensembleaccuracyrdquo Machine Learning vol 51 no 2 pp 181ndash207 2003

[16] P Sollich and A Krogh ldquoLearning with ensembles howoverfitting can be usefulrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo96) vol 8 pp190ndash196 1996

12 BioMed Research International

[17] G Brown J Wyatt R Harris and X Yao ldquoDiversity creationmethods a survey and categorisationrdquo Information Fusion vol6 no 1 pp 5ndash20 2005

[18] P Chi Efficient protein tertiary structure retrievals and clas-sifications using content based comparison algorithms [PhDdissertation] University of Missouri at Columbia ColumbiaMo USA 2007

[19] A Poleksic ldquoAlgorithms for optimal protein structure align-mentrdquo Bioinformatics vol 25 no 21 pp 2751ndash2756 2009

[20] A A Shah Studies on distributed approaches for large scalemulti-criteria protein structure comparison and analysis [PhDthesis] University of Nottingham Nottingham UK 2011httpeprintsnottinghamacuk11735

[21] M Pharr and R Fernando GPU Gems 2 Programming Tech-niques for High-Performance Graphics and General-PurposeComputation Pearson Eduction Addison-Wesley Professional2005

[22] M Azimi N Cherukuri D Jayashima et al ldquoIntegrationchallenges and tradeoffs for tera-scale architecturesrdquo Intel Tech-nology Journal vol 11 pp 173ndash184 2007

[23] S Sarkar T Majumder A Kalyanaraman and P P PandeldquoHardware accelerators for biocomputing a surveyrdquo in Pro-ceedings of the IEEE International Symposium on Circuits andSystems (ISCAS rsquo10) pp 3789ndash3792 IEEE Paris France June2010

[24] E Kouskoumvekakis D Soudris and E S Manolakos ldquoMany-core CPUs can deliver scalable performance to stochasticsimulations of large-scale biochemical reaction networksrdquo inProceedings of the International Conference onHigh PerformanceComputing amp Simulation (HPCS rsquo15) pp 517ndash524 IEEE Ams-terdam The Netherlands July 2015

[25] A A Shah D Barthel and N Krasnogor ldquoGrid and distributedpublic computing schemes for structural proteomics a shortoverviewrdquo in Frontiers of High Performance Computing andNetworking ISPA 2007 Workshops vol 4743 pp 424ndash434Springer Berlin Germany 2007

[26] E Totoni B Behzad S Ghike and J Torrellas ldquoComparing thepower and performance of Intelrsquos SCC to state-of-the-art CPUsand GPUsrdquo in Proceedings of the IEEE International Symposiumon Performance Analysis of Systems amp Software (ISPASS rsquo12) pp78ndash87 IEEE Computer Society New Brunswick NJ USA April2012

[27] T Bjerregaard and S Mahadevan ldquoA survey of research andpractices of network-on-chiprdquoACMComputing Surveys vol 38no 1 article 1 Article ID 1132953 2006

[28] B Marker E Chan J Poulson et al ldquoProgramming many-corearchitecturesmdasha case study dense matrix computations on theIntel single-chip cloud computer processorrdquo Concurrency andComputation Practice and Experience 2011

[29] G Blake R G Dreslinski and TMudge ldquoA survey of multicoreprocessorsrdquo IEEE Signal Processing Magazine vol 26 no 6 pp26ndash37 2009

[30] P Guerrier and A Greiner ldquoA generic architecture for on-chip packet-switched interconnectionsrdquo in Proceedings of theConference on Design Automation and Test in Europe (DATErsquo00) pp 250ndash256 ACM Paris France 2000

[31] D Atienza F Angiolini S Murali A Pullini L Benini and GDe Micheli ldquoNetwork-on-chip design and synthesis outlookrdquoIntegration vol 41 no 2 pp 1ndash35 2008

[32] R P Mohanty A K Turuk and B Sahoo ldquoPerformanceevaluation of multi-core processors with varied interconnectnetworksrdquo in Proceedings of the 2nd International Conference onAdvanced Computing Networking and Security (ADCONS rsquo13)pp 7ndash11 IEEE Computer Society Mangalore India December2013

[33] S Isaza Multicore architectures for bioinformatics applications[PhD thesis] University of Lugano Lugano Switzerland 2011

[34] S B Needleman and C D Wunsch ldquoA general method appli-cable to the search for similarities in the amino acid sequenceof two proteinsrdquo Journal of Molecular Biology vol 48 no 3 pp443ndash453 1970

[35] S Sarkar G R Kulkarni P P Pande and A Kalyanara-man ldquoNetwork-on-chip hardware accelerators for biologicalsequence alignmentrdquo IEEE Transactions on Computers vol 59no 1 pp 29ndash41 2010

[36] A Sharma A Papanikolaou and E SManolakos ldquoAcceleratingall-to-all protein structures comparison with TMalign usinga NoC many-cores processor architecturerdquo in Proceedings ofthe 27th IEEE International Parallel and Distributed ProcessingSymposium Workshops amp PhD Forum (IPDPSW rsquo13) pp 510ndash519 IEEE Cambridge Mass USA May 2013

[37] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[38] N Krasnogor and D A Pelta ldquoMeasuring the similarity ofprotein structures by means of the universal similarity metricrdquoBioinformatics vol 20 no 7 pp 1015ndash1021 2004

[39] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[40] N Melot K Avdic C Kessler and J Keller ldquoInvestigationof main memory bandwidth on Intel Single-Chip Cloudcomputerrdquo in Proceedings of the 3rd Many-Core ApplicationsResearch Community Symposium (MARC rsquo11) pp 107ndash110Ettlingen Germany July 2011

[41] S Saini H Jin R Hood D Barker P Mehrotra and RBiswas ldquoThe impact of hyper-threading on processor resourceutilization in production applicationsrdquo in Proceedings of the18th International Conference on High Performance Computing(HiPC rsquo11) pp 1ndash10 Bangalore India December 2011

[42] H Gonzalez-Velez and M Leyton ldquoA survey of algorithmicskeleton frameworks high-level structured parallel program-ming enablersrdquo Software Practice and Experience vol 40 no12 pp 1135ndash1160 2010

[43] D K G Campbell ldquoTowards the classification of algorithmicskeletonsrdquo Tech Rep YCS 276 1996

[44] A Silberschatz P B Galvin and G Gagne Operating SystemConcepts John Wiley amp Sons Wiley Publishing 8th edition2008

[45] D M Tax and R P W Duin ldquoFeature scaling in support vectordata descriptionsrdquo in Learning from Imbalanced Datasets NJapkowicz Ed pp 25ndash30 AAAI Press Menlo Park Calif USA2000

[46] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

BioMed Research International 13

[47] R Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing ViennaAustria 2015 httpwwwr-projectorg

[48] M Veeramalai D R Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 article 138 2010

[49] G Maccari G L B Spampinato and V Tozzini ldquoSecStAnTsecondary structure analysis tool for data selection statisticsand models buildingrdquo Bioinformatics vol 30 no 5 pp 668ndash674 2014

[50] S Mertens ldquoThe easiest hard problem number partitioningrdquoin Computational Complexity and Statistical Physics A PercusG Istrate and C Moore Eds pp 125ndash139 Oxford UniversityPress New York NY USA 2006

[51] OpenMP API for parallel programming version 40 httpopenmporgwp

[52] J Handl J D Knowles and D B Kell ldquoComputational clustervalidation in post-genomic data analysisrdquo Bioinformatics vol21 no 15 pp 3201ndash3212 2005

[53] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[54] M P Forum ldquoMPI amessage-passing interface standardrdquo TechRep University of Tennessee Knoxville Tenn USA 1994

[55] Q Zou X-B Li W-R Jiang Z-Y Lin G-L Li and K ChenldquoSurvey of MapReduce frame operation in bioinformaticsrdquoBriefings in Bioinformatics 2013

[56] N Malod-Dognin and N Przulj ldquoGR-Align fast and flexiblealignment of protein 3D structures using graphlet degreesimilarityrdquo Bioinformatics vol 30 no 9 pp 1259ndash1265 2014

[57] W Xie and N V Sahinidis ldquoA reduction-based exact algorithmfor the contact map overlap problemrdquo Journal of ComputationalBiology vol 14 no 5 pp 637ndash654 2007

[58] L P Chew and K Kedem ldquoFinding the consensus shape for aprotein family (extended abstract)rdquo in Proceedings of the 18thACM Annual Symposium on Computational Geometry (SCGrsquo02) pp 64ndash73 ACM Barcelona Spain June 2002

[59] D Fischer A Elofsson D Rice and D Eisenberg ldquoAssessingthe performance of fold recognition methods by means ofa comprehensive benchmarkrdquo in Proceedings of the PacificSymposium on Biocomputing pp 300ndash318 January 1996

[60] B Rost and C Sander ldquoPrediction of protein secondary struc-ture at better than 70 accuracyrdquo Journal of Molecular Biologyvol 232 no 2 pp 584ndash599 1993

[61] A Caprara R D Carr S Istrail G Lancia and B Walenzldquo1001 Optimal PDB structure alignments integer programmingmethods for finding the maximum contact map overlaprdquoJournal of Computational Biology vol 11 no 1 pp 27ndash52 2004

[62] R Andonov N Yanev and N Malod-Dognin ldquoAn efficientlagrangian relaxation for the contact map overlap problemrdquo inAlgorithms in Bioinformatics 8th InternationalWorkshopWABI2008 Karlsruhe Germany September 15ndash19 2008 ProceedingsK A Crandall and J Lagergren Eds vol 5251 of Lecture Notesin Computer Science pp 162ndash173 Springer Berlin Germany2008

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 2: Research Article Efficient Multicriteria Protein Structure ...mark MPBT data. Legacy soware by default does not mark memory as MPBT and runs without modi cations. If the bit is set,

2 BioMed Research International

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

TileR

System IF

Core 1

Core 0

Router MPB

Tile

Mem

ory

cont

rolle

r

Mem

ory

cont

rolle

r

Mem

ory

cont

rolle

r

Mem

ory

cont

rolle

rL2$0

L2$1

Figure 1 The Intel Single-Chip Cloud Computer It is a network on chip processor architecture having a mesh of 4 by 6 = 24 routers (R)connecting the 24 ldquotilesrdquo with 2 cores per tile [28]

These parallel processing architectures have become morereadily available [21 22] and instances of their use arebeginning to appear in the broader field of biocomputing[23 24] These processor architectures can in principle beused additively to meet the ever increasing computationaldemands of MCPSC by complementing already in use dis-tributed computing approaches [25]

Multicore and many-core CPUs as opposed to GPUsretain backward compatibility to well-established program-ming models [26] and offer the key advantage of usingprogramming methods languages and tools familiar tomost programmers Many-core processors differ from theirmulticore counterparts primarily on the communicationsubsystem Many-core CPUs use a Network-on-Chip (NoC)while multicore CPUs use bus-based structures [27] TheSingle-Chip Cloud Computer (SCC) experimental processor[28] is a 48-core NoC-based concept-vehicle created by IntelLabs in 2009 as a platform for many-core software researchWhile multicore CPUs are ubiquitous and easily availablefor parallel processing due to the drive from leading chipmanufacturers over the past several years [29] many-coreCPUs are not as deeply entrenched or commonly availableyet However due to architectural improvements many-coreprocessors have the potential to deliver scalable performanceby exploiting a larger number of cores at a lower costof intercore communication [30 31] Moreover many-coreprocessors offer improved power efficiency (performance perwatt) due to the use of more ldquolight-weightrdquo cores [26 32] Allthese factors makemodern CPUs (multicore andmany-core)powerful technologies suitable for meeting the high com-putational demands of MCPSC computations by employingscalable parallel processing Both multicore and many-coreprocessors have been used in bioinformatics [33] mainly forpairwise or multiple sequence comparison [34 35] HoweverNoC architectures have not yet been extensively employeddespite the flexibility and parallel processing capabilities theyoffer [35]

In this work we extend the framework presented in [36]for MCPSC We augment the software with implementa-tions of two popularly used PSC methodsmdashCombinatorialExtension (CE) [37] and Universal Similarity Metric (USM)[38]mdashin addition to TMalign [39] and then combine them

to create a scalable MCPSC application for the SCC We alsodevelop an equivalent MCPSC software implementation fora modern multicore CPU We combine sequential processing(for USMPSC tasks) and distributed processing (for TMalignand CE PSC tasks) to obtain high speedup on a singlechip Our solution highlights the flexibility offered by thesearchitectures in combining the parallel processing elementsbased on the requirements of the problem We analyze thecharacteristics of the MCPSC software on the SCC with aview of achieving high speedup (efficiency) and comparethe performance to that of the multicore implementationContrasting the performance of multicore and many-coreprocessors using multiple datasets of different sizes high-lights the advantages and disadvantages of each architectureas applicable to MCPSC To the best of our knowledgethis is the first attempt in the literature to develop andcompare MCPSC implementations for multicore and many-core CPUs

2 Methods

21 Experimental Systems Used The Single-Chip CloudComputer (SCC) is an experimental Intel microprocessorarchitecture containing 48 cores integrated on a silicon CPUchip It has multiple dual times86 core ldquotilesrdquo arranged in a 4 times6 mesh memory controllers and a 24-router mesh networkas depicted in Figure 1 Although it is a ldquoconcept-vehiclerdquo[40] the technology it uses is scalable to more than 100cores on a single chip [28] The SCC resembles a cluster ofcomputer nodes capable of communicating with each otherin much the same way as a cluster of independent machinesSalient features of the SCC hardware architecture relevant toprogramming the chip are listed in Table 1

Further each core of the SCC has L1 and L2 caches Inthe SCC architecture the L1 caches (16 KB each) are on thecore while the L2 caches (256KB each) are on the tile nextto the core with each tile carrying 2 cores Further each tilealso has a small message passing buffer (MPB) of 16 KBand is shared among all the cores on the chip Hence with24 tiles the SCC provides a Message Passing Buffer Type(MPBT) memory of size 384KB The SCC therefore providesa hierarchy of memories usable by application programs for

BioMed Research International 3

Data119876 query protein structures119863 database of known protein structures Sequential processingcalc usm dist(119876119863) Parallel processingfor 119898 in [CE TMalign] do

for 119902 in 119876 dofor 119889 in119863 do

jobsadd(in PE 119899 using method119898 protein pair [119902 119889])end

endendPar calc psc dist(jobs)

Algorithm 1 A pseudo-code of the MCPSC computation implemented in this work The USM jobs are processed sequentially while CEand TMalign pairwise structure comparison jobs are executed in parallel Task par calc psc dist assigns each PSC job to a free PE (node in amany-core and thread in multicore) If n is specified the jobrsquos assignment is fixed otherwise the next free PE available is employed

Table 1 Salient features of the SCC Chip by Intel

Corearchitecture

6 times 4 mesh 2 Pentium P54c (times86) coresper tile

Local cache 256KB L2 Cache 16 KB shared MPB pertile

Main memory 4 iMCs 16ndash64GB total memory

different purposes including processing and communicationA reserved bit in the page table of the cores is used tomark MPBT data Legacy software by default does not markmemory as MPBT and runs without modifications If the bitis set data is cached in L1 and bypasses L2

To benchmark performance we run all-to-all PSC usingdifferent methods on a single and on multiple cores of theSCC Each core is a P54C Intel Pentium (32-bit) running SCCLinux We have also used a PC with a 3GHz AMD AthlonII X2 250 64-bit processor and 3GB RAM running DebianSid to analyze the characteristics of individual PSC methodsand develop PSC job partitioning schemes Lastly amulticorebaseline was established by running all-to-all MCPSC usingthe multithreaded version of MCPSC on a Intel Core i7-4771 ldquoHaswelrdquo Quad-Core CPU running at 35 GHz with8GB of RAM and a SSD running Debian Sid The Core i7CPU features highly optimized out-of-order execution andHT (Hyper Threading) [41] Intels flavor of SimultaneousMultithreading (SMT) All software developed was compiledwith the GNU C Compiler version 48

22 Software Framework for Implementing PSC Methods Weintroduced in [36] a framework for porting a PSC methodto the SCC and have used it to implement a master-slavesvariant of the TMalign PSC method [39] The frameworkallows developing efficient parallel implementations usingbasic functionality provided by our rckskel skeletons libraryfor a many-core processor Algorithmic skeletons allowsequential implementations to be parallelized easily without

specifying architecture dependent details [42] By nestingand combining skeletons the desired level of parallelism canbe introduced into different PSC methods While severalalgorithmic skeleton libraries have been developed [42 43]none of them has been targeted specifically for many-coreprocessor architectures Using rckskel allowed efficient usageof the SCC cores to speed up all-to-all PSC In this work weextend the framework to handle MCPSC while allowing themethods to be combined serially where required

In Algorithm 1 we provide the pseudo-code of the Mul-ticriteria PSC application for the general case of comparingevery element in a set of query proteins119876with every elementin a set of a reference database of proteins 119863 using threePSC methods It covers both the ldquoall-to-allrdquo comparisonwhere the set of query proteins is the same as the setof database proteins and the ldquomany-to-manyrdquo comparisonwhere the query proteins are different from the databaseproteins We used the MCPSC application developed tomeasure completion times of all-to-all PSCThe size of the joblist generated for parallel processing consists of119870119876times119870119863times119872PSC jobs where 119870119876 (119870119863) is the number of proteins in set119876 (119863) respectively and 119872 is the number of PSC methodsimplemented using parallel processing (119872 = 2 currently)This is the most fine grained job distribution that can beobtained without parallelizing the pairwise PSC operation

The rckskel library [36] was upgraded in this work toC++ allowing for more flexibility in usage The choice of theprogramming language was motivated by two requirementsseamless handling of communication data and type safetyIn order to perform parallel operations data must be passedbetween the cores of the SCC In addition to the controlstructures built into the library this includes user data Sincethe user data can vary depending on the user application thelibrary needs to provide support for any data structures tobe passed between the cores This was achieved by using theboost object serializationdeserialization construct whichallows the library to accept any arbitrary data type from theuser as long as it can be serialized By making use of the C++

4 BioMed Research International

template extension the skeletons in rckskel are completelytype safe This implies that any inconsistency in the dataflow described by the user while constructing the skeletonhierarchy is caught at compile time instead of run time whichsignificantly simplifies developing complex parallel programsusing the library

23 MCPSC Application Development for Many-Core Proces-sor In order to perform Multicriteria PSC an applicationwas developed using our software framework incorporatingthree PSC methods For the CE and USM methods weported existing software to the SCC The implementation ofCE builds on existing C++ sources which require a smallstructure manipulation library as an external binary Thislibrary was modified in order to link it to the main CEcode and deprecate the need for the program to run as anexternal binary Further the code was modified to compareonly the first chains of the domains as in TMalign Wethen generated amaster-slaves parallel implementation of theCC++ code using rckskel similar to the master-slaves portwe have developed for TMalign in [36] We also developeda C++ implementation of USM based on the existing Javasources [38] so that we could integrate it with the restof the software As per the existing implementation theKolmogorov complexity of the proteins was approximatedwith the gzip of their contact maps Therefore we takeit that the normalized compression distance between thecontact maps of a pair of proteins expresses their USMsimilarity In the final implementation the master processexecutes the USM jobs sequentially due to the fast processingtimes for these jobs Subsequently the master distributesthe CE and TMalign PSC jobs to the available slaves tobe executed in parallel on different cores Moreover weextended significantly the previous implementation [36] byadding load balancing The software framework supportsboth dynamic round-robin [44] and static jobs partitioningthus allowing experimentation with different methods andcomparison in order to select the best approach for the givenmany-core processor and dataset

24 MCPSC Application Development for Multicore ProcessorIn addition to implementing the MCPSC framework for theSCCwe also developed an implementation formulticore pro-cessors Intercore communication calls handled in the NoCimplementation by rckskel are replaced by reference passingbetween multiple-threads implemented using OpenMP Theresulting software combines the three methods USM CEand TMalign in the same way as the NoC implementationThe USM jobs are therefore processed serially while CE andTMalign are processed in parallel Since the three methodsare combined into one executable the structure data for eachprotein domain is loaded only once and reused whenever thedomain appears in a pairwise comparison being performedThis reduces the overall processing time as compared torunning each method currently available software whereonly pairs are accepted thereby requiring multiple reloads ofeach protein domain to perform all-to-all analysis (119873 + 1times for each domain if the dataset contains 119873 domains)

Further the software allows direct comparison between theperformance of many-core and multicore processors for theMCPSC problem because all other factors algorithms anddata remain constant

25 MCPSC Consensus Score Calculation In both imple-mentations for each protein pair a total of 119872 comparisonscores are returned where119872 is the number of PSC methodsemployed These results are stored in a 119875 times119872matrix where119875 is the total number of protein pairs in a dataset Individualmethod comparison scores may or may not be normalizeddepending on the PSC method used therefore we normalizeall scores in the matrix between 0 and 1 along each column(feature scaling [45]) that is along each PSM method usedThe normalization process proceeds as follows First themin-imum and maximum values for each column of the matrixcontaining the dissimilarity values are determined Thenthe matrix is scaled by dividing the difference of columnelements and the column minimum value by the differenceof the columnmaximum andminimum value as shown in (1)(where 119894 is the row index and 119895 is the column index) Notethat the minimum value of any column cannot be less thanzero because these are PSC scores with a minimum value ofzero Once the columns have been normalized a consensusMCPSC score is calculated for each protein pair by taking theaverage of the 119872 values corresponding to that protein pairIt must be noted that the smaller the pairwise PSC score thehigher the similarity between the pair proteins

1198831015840

119894119895 =

119883119894119895 minusmin (119883119895)

max (119883119895) minusmin (119883119895) (1)

26 Datasets Used Several datasets listed in Table 2 wereused in this work The table includes statistics about thelength of the protein domains and the distribution of theSCOP [46] classifications in each dataset It can be seen thatthere is significant variation in the sizes of the datasets thelengths of protein domains they contain and the distributionof the SCOP classifications For each dataset we retaineddomains where the PDB file could be downloaded and aclassification could be found for the domain in SCOP v175Further the scripts available for download with the USMsources were used to extract the contact maps from the PDBfiles The Chew-Kedem (CK34) and the Rost-Sander (RS119)datasets were used to develop the load balancing schemesfor the SCC and to study the speedup (throughput) on thei7 processor Further all the datasets were used to comparethe performance of the SCC with that of the i7 on all-to-allMCPCS processing

27 F-Measure Analysis for the Chew-Kedem Dataset Weobtained all-against-all comparison scores for all comparisonmethods and based on these scores we performed pairwisehierarchical clustering for each method using hclust fromthe stats package in 119877 [47] We calculated the 119865-measureof the hierarchical clustering for each PSC methods (TMa-lign USM CE and MCPSC) The 119865-measure is calculated

BioMed Research International 5

Table 2 Basic statistics of the domains in the datasets used in this work The table includes the number of SCOP families superfamilies(SpFams) and folds as well as the total number of domains (No) the Minimum (Min) Maximum (Max) Median and Mean and StandardDeviation (Std) of the domain lengths in each dataset

Dataset No Domain lengths SCOP classificationsMin Max Median Mean Std Families SpFams Folds

Skolnick [57] 33 97 255 158 1677 627 5 5 5Chew-Kedem [58] 34 90 497 147 1795 1001 10 9 9Fischer [59] 68 62 581 181 2206 1256 56 44 40Rost-Sander [60] 114 21 753 167 1932 1234 93 85 71Lancia [61] 269 64 72 68 679 24 79 72 57Proteus [62] 277 64 728 239 2476 1167 53 47 41

Table 3 Time required for the baseline all-to-all PSC task using the TMalign CE and USM PSC methods on the PC and a single core of theSCC The table also shows the time required for the all-to-all MCPSC task (where all three PSC methods are used) All times are in seconds

MethodPorted software on PC Ported software on SCC 1 core

CK34 RS119 CK34 RS119Load Processing Load Processing Load Processing Load Processing

TMalign 001 127 005 1725 030 2514 1 33452CE 080 374 3 6459 14 7152 50 132205USM 001 060 004 7 070 30 2 345All 080 502 3 8191 15 9697 53 166000

using (2) where 119862 is the number of clusters (119862 = 5 in thiswork) as described in [48]

True Positives (TP)

=

119862

sum

119894=1

number of correctly assigned domains

False Positives (FP)

=

119862

sum

119894=1

number of incorrectly assigned domains

False Negatives (FN)

=

119862

sum

119894=1

number of missing domains

Precision (P) = TPTP + FP

Recall (R) = TPTP + FN

F-measure (F) = 2 lowast 119875 lowast 119877119875 + 119877

(2)

3 Results and Discussion

31 Baseline All-to-All PSC Table 3 provides the times takento perform all-to-all comparison on the PC and on a singlecore of the SCC with the CK34 and RS119 datasets Thesoftware took longer to load data and process jobs on a singlecore of the SCC as compared to the PC The longer data

loading times are due to the need for network IO since thedata is stored on the Management Console PC (MCPC) andaccessed by the SCC via NFS In long running services how-ever data is loaded once and used when needed thereforethis time is disregarded in our performance comparisonsThedifferences in processing times are due to the architecturedifference of the processor cores (times86-64 versus times86) andtheir operating frequencies (3GHz versus 533MHz) on thePC and a single SCC core respectively

We performed a statistical analysis using SecStAnT [49]of the proteins participating in the 10 slowest and the 10 fastestpairwise PSC tasks (using TMalign and CE) using the CK34and RS119 datasetsThe parameters on which SecStAnT com-pares proteins are their secondary structure characteristicsBond Angle Dihedral Angle Theta and Psi Results of the 1-parameter and 2-parameter statistical analysis distributionsand correlations did not show any significant statistical differ-ence between the slowest and fastest PSC domains Howeverthe slowest pairs contain the Hlx310Alpha supersecondarystructure which is absent from the fastest pairs The 310-helixoccurs rarely in protein domains because the tight packingof the backbone atoms makes it energetically unstable Thisstructure might be responsible for the increased alignmenttimes in certain domains Further we compared the proteinsusing their UniPort entries and observed that most proteinsin the slowest set are Cytoplasmic while most members inthe fastest set are membrane proteins

32 Relation of Expected PSC Times to the Lengths of theProteins underComparison Comparison of the pairwise PSCprocessing with respect to the protein lengths as shown inFigure 2 reveals some interesting facts the time requiredfor completing a pairwise PSC task is clearly a function of

6 BioMed Research International

Table 4 Sum of squares of residuals after curve fitting for the different PSC methods The table shows that the sum of residuals is lower withquadratic fit (square) as compared to linear fit All values are multiplied with 1119890 + 06

PSC methodResiduals (sum of lengths) Residuals (product of lengths)

CK34 RS119 CK34 RS119Linear Square Linear Square Linear Square Linear Square

TMalign 4 1 100 90 1 1 40 30CE 200 100 4000 1000 100 50 2000 800

0123456789

01 02 03 04 05 06 07 08 09 1

Tim

e (se

c)

0123456789

Tim

e (se

c)

Sum of lengths of pair proteins

01 02 03 04 05 06 07 08 09 1

CK34 dataset

CETMalign

CETMalign

CETMalign

CK34 dataset

0

5

10

15

20

25

0 01 02 03 04 05 06 07 08 09 1Ti

me (

sec)

Tim

e (se

c)Sum of lengths of pair proteins

RS119 dataset

CETMalign

00 01 02 03 04 05 06 07 08 09 1Product of lengths of pair proteinsProduct of lengths of pair proteins

RS119 dataset25

20

15

10

5

0

Figure 2 Comparison of pairwise PSCprocessing times usingTMalign andCEwith respect to the normalized sumof lengths andnormalizedproduct of lengths of pair proteins The dotted lines show the quadratic best fit for the data

the lengths of the proteins forming the pair USM requiresextremely small pairwise comparison times as shown inTable 1 and was thus excluded from this analysis The bestfit is achieved when using the product of lengths of the pairproteins as an attribute and is better than the fit achievedwhen using the sum of the lengths Moreover the quadraticcurve fitting results are better than the linear as shown inTable 4 For a given PSC method the relative times requiredfor completing a pairwise PSC task depend on the proper-ties of the pairs participating However the absolute timerequired for completing a pairwise PSC task depends also onthe complexity of the method used Since the complexity ofone pairwise PSC method as compared to another cannot be

known a priori it cannot be used as a factor for a generic jobspartitioning scheme for load balancing purposes

33 Speedup and Efficiency Analysis of Multiple Criteria PSCon the SCC NoC CPU In this work we experimented withboth round-robin and job partitioning schemes for loadbalancing Given the close correlation between lengths of pairproteins and processing timeswe used this as criteria for bothRound-robin [44] is a dynamic load balancing method inwhich the master process maintains a list of jobs and handsthe next job in the list to the next free slave process It isworth noting that if the list is presorted based on attributesthat are known to be proportional to the processing times this

BioMed Research International 7

Table 5 Comparison of speedup and efficiency achieved by the load balancing schemes in processing the all-to-all MCPSC task vis-a-vie thesingle core (SCC) processing times The efficiency is calculated assuming 47 processing elements (one PE serves as master) All times are inseconds

Load balancing scheme Dataset CK34 Dataset RS119Time Speedup Efficiency Time Speedup Efficiency

Serial (1 SCC core) 9698 mdash mdash 166000 mdash mdashRandom 479 20 043 4765 35 074Greedy partitioning (sum) 409 24 050 4300 39 082Greedy partitioning (product) 396 28 059 4433 38 080Round-robin (sum sorted) 239 41 086 3930 42 090Round-robin (product sorted) 238 41 086 3930 42 090

would naturally result in little to no idle times for the slaveprocesses For static job partitioning determining job-to-coremapping is equivalent to constructing119873 equal partitionsfor a list of elements which is an NP-hard problem [50]In our case 119873 is the number of cores and is equal to 47when all SCC cores are used We investigated methods forPSC jobs partitioning based on the sum and the product ofthe lengths of the two proteins to be compared We havestatically partitioned the PSC jobs using as attribute the sumof lengths (product of lengths) of the paired proteins to becompared The partitions are generated using the LongestProcessing Time (LPT) algorithm the jobs are sorted indescending order by sum of lengths (product of lengths)and then assigned to the partition with the smallest totalrunning sumThese partitions will be referred to as ldquoGreedyrdquopartitions fromnow on Since the job partitions are processedon different cores this strategy is expected to balance theprocessing times by reducing idle times

Figure 3 shows the partitions generated using a randompartitioning and a greedy partitioning scheme based on thesum of lengths and the product of lengths as attributes Parti-tions created with the random scheme with equal number ofpair proteins per partition vary greatly in terms of the totalnormalized sum of the sum (or product) of the pair proteinlengths The size of this normalized sum is expected to beproportional to the overall processing time each partitionwillrequire when assigned to a core Therefore if each partitionof PSC tasks created with the random scheme is processedon a different core several cores will have significant idletimes at the end On the contrary partitions created with thegreedy scheme are almost equal as to the normalized sum ofthe sum (or product) of the pair protein lengths Hence ifeach partition of PSC tasks created by the greedy scheme isprocessed on a different processing element (PE) idle timesare expected to be minimized

We extended the framework introduced in [36] by incor-porating more PSC methods (CE and USM) in additionto TMalign and computing consensus scores in order toperform Multicriteria PSC We retain the master-slaves pro-cessing model in the extended framework the first availablecore runs the master process and all other cores run slaveprocesses The master loads the data and generates a list ofjobs each one involving a single pairwise Protein StructureComparison operation to be performed using a specifiedPSC method Due to the vastly different processing times

0163248

0163248

0163248

0 2 4 6 8 10 12

Cor

e Id

Random split CK34 dataset

0 05 1 15 2 25 3 35 4

0 05 1 15 2 25 3 35 4

Cor

e Id

Greedy (sum) split CK34 dataset

Cor

e Id

Greedy (prod) split CK34 dataset

Figure 3 The 47 partitions created from list of PSC tasks sortedrandomly or by the sum (product) of lengths of pair proteins Eachhorizontal line represents the sum of the normalized lengths of theprotein pairs assigned to that partition

of PSC operations when using different methods we allowthe master process (a) to make a choice between sequentialversus parallel job processing and (b) implement a loadbalancing scheme for the slavesThe first allows the master tosequentially process jobs of a PSC method when distributingthem which would actually result in an increased processingtime The second allows the master to reduce idle times inslave processes which perform PSC continuously receivingjobs from and returning the results to the master till theyreceive a terminate signal

The consensus value for a protein pair is calculated usingthe scores from the three PSC methods (a) the USM-metricfrom USM (b) the TM-score from TMalign and (c) the RootMean Square Distance (RMSD) from CE The USM-metricand RMSD are distances and have a zero value for identicalstructures The TM-score is therefore inverted since it is theonly one of the three which is a similarity metric At the endthe average of the three normalized dissimilarity scores istaken as the MCPSC score for the protein pair

In Table 5 we compare the performance of the master-slaves setup with one master and 47 slaves to that of a singlecore of the SCC and show the speedup and the efficiencyachieved using the CK34 and RS119 datasets TMalign andCE jobs were run under the master-slaves model while USM

8 BioMed Research International

Table 6 Speedup efficiency and throughput in pairwise-PSC tasks per second for performing MCPSC on an Intel Core i7 multicore CPUusing the CK34 and RS119 datasets

Threads Dataset CK34 Dataset RS119Speedup Efficiency Throughput Speedup Efficiency Throughput

1 100 100 492 100 100 3492 188 094 927 146 073 5093 204 068 1001 198 066 6904 254 063 1247 260 065 9085 270 054 1325 270 054 9426 271 045 1332 271 045 9457 274 039 1347 269 038 9388 264 033 1298 267 033 931

jobs were processed by the master due to the negligible timerequired by these jobs

The greedy static partitioning scheme improves on theperformance of the random partitioning scheme but it isinadequate for the SCC where the cost of data transferbetween cores is low This scheme would be of interestin clusters of workstations where interconnection networksare slow The greedy partitioning scheme could also bebeneficial on larger many-core processors where the mastermay become a bottleneck making off-line creation of batchesof PSC tasks essential In such a scenario however othertechniques such as work-stealing would need to be assessedbefore selecting the most appropriate method

The round-robin dynamic job assignment strategy out-performs the best offline partitioning scheme Furthermorethe sorted round-robin flawlessly balances out the cores pro-cessing times on both datasets making it the best approachfor distributing jobs to the cores of the SCC In this approachthe protein pairs are presorted in descending order of product(or sum) of pair protein lengths Specifically for the one-to-many PSC case we simply retrieve the database proteins indescending order of length and create pairwise PSC taskswiththe query protein For the many-to-many PSC case howeverwe need to determine the correct order of the proteins whenthe query proteins are received

The sorted round-robin strategy distributes the jobs mostefficiently for the proposed master-slaves setup on the IntelSCC with 47 slaves This setup achieves a 42-fold (40-fold)speedup for the RS119 (CK34) dataset as compared to a singlecore of the SCCThe speedup is almost linear which suggeststhat high efficiency can be achieved evenwhen themany-coreprocessor has more nodes A bigger many-core processorhowever may require using a hierarchy of masters to avoida bottleneck on a single master node In such a scenarioa combination of partitioning schemes and round-robinjob assignment could be used to distribute and minimizeidle times For instance on larger NoC processors a well-balanced solution may require the use of clusters of process-ing elements concurrently processing PSC jobs with differentmethods All PEs computing jobs of a PSC method wouldreceive jobs from a specific submaster All such submasterswould in turn receive jobs from a common global master

34 Comparison of Multicriteria PSC Performance on Multi-core and Many-Core CPUs In order to perform this exper-iment the MCPSC software framework was retargeted tothe Intel Core i7 multicore processor using openmp [51] toimplement multithreading This experiment allowed us toassess how the MCPSC problem scales with increasing num-ber of cores on a modern multicore CPU readily availablefor scientists and engineers This multicore software versionuses shared memory to replace the RCCE based messagepassing (recv and send) calls used in the many-core versiondeveloped for the Intel SCC As shown in Table 6 we observespeedup on this multicore processor when running all-to-allMCPSCwith the CK34 and RS119 datasets up to the 4 threadsconfigurations Thereafter a steep speedup drop (efficiencyloss) is observed

This is attributed to the fact that this is a quad-coreprocessor that implements Hyper Threading (HT) [41] Sinceall threads operate on exactly the same type of instruc-tions workload (SPMD) it cannot take advantage of eachcorersquos super scalar architecture that needs varied workloadplacement (eg integer and floating point arithmetic at thesame time) to show performance advantages when utilizingmore threads than cores Further an SMP Operating System(OS) like the one we run on our test system uses allsystem cores as resources for swapping threads in and outA thread executing on core 1198751198641 for some number of cyclesmay get swapped out and later resume execution on core1198751198642 Thus the thread cache on core 1198751198641 becomes useless andit gets a lot of cache misses when restarting on 1198751198642 ThisOS behavior impacts cache performance and reduces theoverall application performance In terms of pairwise-PSCtasks per second (pps) the Intel Core i7 achieves a through-put of 1347 pps (945 pps) on the CK34 (RS119) dataset ascompared to 553 pps (363 pps) achieved by the SCC Thisis due to the fact that the multicore CPU contains latestgeneration cores featuring a highly superior out-of-ordermicroarchitecture and clocked at almost 75x the frequencyof the SCC We believe that even more performance can beexploited from next generation multicore processors shouldthey start introducing hardware memory structures like theMessage Passing Buffer (MPB) of the SCC NoC processoralongside their cores for increased communication efficiency

BioMed Research International 9

Table 7 Comparison of the SCC and i7 CPU in terms of throughputdelivered on all-to-all MCPSC

Dataset Pairs SCCthroughput

i7throughput Ratio

Skolnick 1089 605 2792 462Chew-Kedem 1156 503 1329 264Fischer 4624 198 937 474Rost-Sander 12996 330 938 285Lancia 72361 mdash 22265 mdashProteus 76729 mdash 677 mdash

among them without resorting to shared memory and itsunavoidable locks or cache-coherency protocol overheads

Finally Table 7 shows the comparative performance ofthe SCC (with 48 cores) and the i7 (running 7 threads)on all-to-all MCPSC using several datasets It can be seenthat the i7 outperforms the SCC consistently in terms ofthroughput The two larger datasets Lancia and Proteuswere not processed on the SCC because of the limited per-core memory (512MB) which was not sufficient to load thedomain structure data for the full dataset A decrease inthroughput was observed as the size of the dataset increaseswith some exceptions Both the SCC and the i7 deliver lowerthan expected throughput (based on its size) on the Fischerdataset which we believe is due to the higher complexity ofthe dataset As evidenced by Table 2 the Fischer dataset hasdomains belonging to 5 times more SCOP Super Families ascompared to CK34 and also has the second highest medianlength of protein domains among all the datasets used in thiswork Conversely the throughput delivered by the i7 on theLancia dataset is significantly higher than that delivered forthe similar size (in terms of number of domains) Proteusdataset We attribute this difference to the large differencebetween themean lengths of domains in each dataset as notedin Table 2

35 Qualitative Analysis of MCPSC Results Based on theChew-Kedem Dataset The acceleration capabilities offeredby modern processors (many-core and multicore) enablesperforming all-to-all PSC using different methods on large-scale datasets and compare results For example it becomeseasy to perform a qualitative analysis on a set of proteindomains and use consensus score (GPrsquoS) to explore relationbetween the protein domains categorize them into bio-logically relevant clusters [48] and automate generation ofdatabases such as Structural Classification of Proteins (SCOP)[46]

The MCPSC comparison method outperforms the com-ponent PSCmethods (TMalign CE andUSM) and groups 34representative proteins fromfivefold families into biologicallysignificant clusters In Figure 4 each protein domain is usingthe format ldquodomainName foldFamiliesrdquo The tags of theldquofoldFamiliesrdquo belong to one of (i) tb TIM barrel (ii) gglobins (iii) ab alpha beta (iv) b all beta and (v) a all alphaprotein families The clusters obtained from the MCPSCmethod show that most domains are grouped correctly

according to their structural fold In comparison all thecomponent PSCmethods produce clusters in which there aremany wrongly grouped domains Details of domain clusterallocation of all four methods are included in the Supple-mentary Materials file (see Supplementary Material availableonline at httpdxdoiorg1011552015563674) dom clustxlswith clusters generated by USM containing the highestnumber of errors The quantitative analysis evaluating theRecall versus Precision tradeoff based on the 119865-measure [52]results in a value of 119865 = 091 for the MCPSC method whichis higher than TMalign (119865 = 082) CE (119865 = 071) andUSM (119865 = 062) Finally in the Supplementary Materialfile supplementary materialpdf weprovide clusters generatedby hclust [47] for the four PSC methods (see Figures S1S2 and S3) PSC scores generated for all the datasets for allPSCmethods are included in the SupplementaryMaterial filepsc scoreszip

4 Conclusions and Future Work

In the near future we will see a continued increase inthe amount of cores integrated on a single chip as well asincreased availability of commodity many-core processorsThis trend is already visible today in that off-the-shelf PCscontain processors with up to 8 cores server grade machinesoften contain multiple processors with up to 32 cores andmulticore processors are even appearing in average grademobile devices Further the success of GPUs Tilerarsquos archi-tecture and Intelrsquos initiative for integrating more and morecores on a single chip also attest to this trend The ubiquityand the advanced core architectures employed by multicoreCPUsmake it imperative to build software that can efficientlyutilize their processing power Our experiments show that amodern Core i7 is able to deliver high throughput for all-to-all MCPSC Utilizing the performance gain delivered bymul-ticoreCPUbecomes especially importantwhen considered incombination with distributed setups such as clusters whichmay already contain nodes with multiple cores

Both multicore and many-core architectures lend them-selves to the popular and powerful MapReduce parallelprogramming model [53] If the underlying communicationlibrariesmdashtypically built on top of Message Passing Interface(MPI) [54]mdashare available this highly popular approachcan be applied to large-scale MCPSC in a manner similarto its application in other domains of bioinformatics [55]While some implementation of MapReduce is available formulticore machines it must be noted that no MapReduceframework implementations are available for the SCC Onepossible approach for using MapReduce in MCPSC wouldbe to (a) determine pairwise PSC to be performed on eachPE (pre-Map) (b) run all PSC methods on all pairs on eachPE in parallel (Map) (c) collect PSC scores from all PEsto a designated PE (shuffle) and (d) combine and generatefinal MCPSC scores (Reduce) Movement of data for taskdistribution (pre-Map) intermediate data transfer performedby the ldquoshufflerdquo mechanism and collection of results ifrequired by the Reduce step would benefit from low cost ofinter-PE communication Load balancing methods similar to

10 BioMed Research International

6xi

a_tb

1jh

g_a

1m

ba_g

1flp

_g1

ith_g

5m

bn_g

1m

yt_g

1ec

a_g

2hb

g_g

1ba

b_g

1as

h_g

3sd

h_g

2lh

b_g

1hl

b_g

1hl

m_g

2vh

b_g

1lh2

_g2

mnr

_tb

1ch

r_tb

4en

l_tb

5p2

1_a

b1

qra_

ab1

gnp_

ab6

q21

_ab

1aa9

_ab

1qf

o_b

1ne

u_b

1cd8

_b1

qa9

_b1

hnf_

b1

cdb_

b1

ci5

_b1

ct9

_ab

1cn

p_a

00

01

02

03

04

05

06

MCPSC

Figure 4 Hierarchical clustering result using the Chew-Kedem dataset and MCPSC score as distance metric between domains Each boxrepresents a cluster and the domains belonging to it The average linkage method was used to build the dendrogram

those discussed in this work could be useful in determiningeffective strategies for the distribution of PSC tasks (pre-Map) Analysis of MCPSC performance on multicore andmany-core processors withMapReduce would be required tounderstand the design trade-offs that would lead to efficientdesigns

Scalability limitation of the bus-based architecture cou-pled with the greater potential of Network-on-Chip basedprocessors to deliver efficiency and high performance impliesthat NoCs are likely to be used extensively in future many-core processors It is therefore important to start developingsoftware frameworks and solutions that capitalize on thisparallel architecture to meet the increasing computationaldemands in structural proteomics and bioinformatics ingeneral Comparison of many-core and multicore processorsin this work shows that the former can deliver higherefficiency than the latter processor architectures Furtherour experimental results make it clear that Intelrsquos SCC NoCmatches the speedup and efficiency achieved by a clusterof faster workstations [14] However a market ready NoCCPU will have a much better per watt performance ascompared to a cluster while also saving in space and infras-tructuremanagement costs Additionally any savings in wattsconsumed also reflect in savings in cooling infrastructurerequired for the hardware Furthermore with the per wattperformance being in focus for processor manufacturerssuch as Intel and Tilera this gap is set to expand evenfurther Finally many-core processors when used as the

CPUs of near-future clusters of workstations will providean additional level of local parallelism to applications whereperformance scalability is essential to tackle ever increasingcomputational demands As this study demonstrates thelikely increase in the availability and ubiquitousness of many-core processors the near linear speedup in tackling theMCPSC scenario and the ease of porting new PSC methodsto NoC based processors make many-core processors be ofgreat interest for the high performance structural proteomicsand bioinformatics communities in general

Incorporating additional PSC methods to a many-coreNoC processor is straightforward due to the familiar pro-gramming model and architecture of the individual coresOur rckskel library [36] offers convenience and parallel pro-gramming functions to further accelerate development on theNoC On future larger NoC processors optimal solutionsmayrequire the use of clusters of cores concurrently processingPSC jobs with different methods Using the rckskel librarywould facilitate the software development for such morecomplicated scenarios hiding low level details from the usersThe advantages offered by an algorithmic skeletons librarylike the rckskel make it of interest to further develop thelibrary To this end we intend to introduce communicationcontrols that will allow true blocking implementations ofldquosendrdquo and ldquorecvrdquo regardless of the communication backendlibrary used (RCCE eg provides busy-wait loops for block-ing) Such an implementation would increase the energy effi-ciency of applications built using the library Further work on

BioMed Research International 11

the librarywould involve separating the algorithmic skeletonscode from the communication subsystem so that it can beretargeted to other processors This would essentially requiresand-boxing the RCCE calls behindwell-structuredAPIs thatcan allow a different communication library to be plugged in

Further we would like to investigate the performancecharacteristics of all-to-all MCPSC on multicore and many-core processors with very large datasets Experimenting withsuch large datasets like the one used in [56] with more than6000 domains is likely to provide higher granularity andassist in more accurate assessment of the impact of differentload balancing schemes in terms of speedup delivered Addi-tionally cluster analysis on the results of such a large datasetwill provide a better measure of the biological relevance ofclusters generated by MCPSC A dataset of that size wouldgenerate more than 32 million pairwise comparison tasksper PSC method and keeping the number of methods to3 (as is the case in this work) would result in nearly 10million tasks Due to these large numbers processing sucha dataset specifically on a many-core processor would befeasible only with a larger NoC that is one which supportsmore memory and uses cores of newer architectures thanthe SCC Showing that such a large computation can becompleted in reasonable time would however further justifythe drive towards leveraging parallel processing capabilitiesof multicore and many-core processors

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Anuj Sharma contributed to the design of the study devel-oped the software framework and performed the experi-ments Elias S Manolakos conceived the study and con-tributed to the design and evaluation All authors contributedto the discussion of results andwrote and approved the paper

Acknowledgments

Mr Anuj Sharma and Dr Elias S Manolakos would like toacknowledge the support of this research by the EuropeanUnion (European Social FundESF) andGreek national fundsthrough the Operational Program ldquoEducation and LifelongLearningrdquo of the National Strategic Reference Framework(NSRF) Research Funding Program ldquoHeracleitus IIrdquo Theauthors would also like to acknowledge the MicroLab ECENational Technical University of Athens Greece for allowinguse of their SCC infrastructure In addition we thank MrDimitrios Rodopoulos and Professor Dimitrios Soudris ofNTUA for providing comments and assistance

References

[1] R A Laskowski J D Watson and J M Thornton ldquoProFunca server for predicting protein function from 3D structurerdquoNucleic Acids Research vol 33 no 2 pp 89ndash93 2005

[2] B Pang N Zhao M Becchi D Korkin and C-R ShyuldquoAccelerating large-scale protein structure alignments withgraphics processing unitsrdquo BMC Research Notes vol 5 no 1article 116 2012

[3] E Bielska X Lucas A Czerwiniec J M Kasprzak K HKaminska and J M Bujnicki ldquoVirtual screening strategies indrug design methods and applicationsrdquo Journal of Biotechnol-ogy Computational Biology and Bionanotechnology vol 92 no3 pp 249ndash264 2011

[4] H Kubinyi ldquoStructure-based design of enzyme inhibitors andreceptor ligandsrdquo Current Opinion in Drug Discovery andDevelopment vol 1 no 1 pp 4ndash15 1998

[5] G Lancia and S Istrail ldquoProtein structure comparison algo-rithms and applicationsrdquo in Mathematical Methods for ProteinStructure Analysis and Design vol 2666 of Lecture Notes inComputer Science pp 1ndash33 Springer Berlin Germany 2003

[6] I Eidhammer I Jonassen andW R Taylor ldquoStructure compar-ison and structure patternsrdquo Journal of Computational Biologyvol 7 no 5 pp 685ndash716 2000

[7] Y B Ruiz-Blanco W Paz J Green and Y Marrero-PonceldquoProtDCal a program to compute general-purpose-numericaldescriptors for sequences and 3D-structures of proteinsrdquo BMCBioinformatics vol 16 no 1 article 162 2015

[8] D Barthel J D Hirst J Błazewicz E K Burke and NKrasnogor ldquoProCKSI a decision support system for protein(structure) comparison knowledge similarity and informa-tionrdquo BMC Bioinformatics vol 8 article 416 2007

[9] M Arriagada and A Poleksic ldquoOn the difference in qualitybetween current heuristic and optimal solutions to the proteinstructure alignment problemrdquo BioMed Research Internationalvol 2013 Article ID 459248 8 pages 2013

[10] D E Robillard P T Mpangase S Hazelhurst and F DehneldquoSpeeDB fast structural protein searchesrdquo Bioinformatics vol31 no 18 pp 3027ndash3034 2015

[11] M Veeramalai D Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 no 1 article 138 2010

[12] S B Pandit and J Skolnick ldquoFr-TM-align a new proteinstructural alignment method based on fragment alignmentsand the TM-scorerdquo BMC Bioinformatics vol 9 no 1 article 5312008

[13] L Mavridis and D Ritchie ldquo3D-blast 3D protein structurealignment comparison and classification using spherical polarFourier correlationsrdquo in Proceedings of the Pacific Symposium onBiocomputing pp 281ndash292 World Scientific PublishingKamuela Hawaii USA January 2010 httphalinriafrinria-00434263en

[14] A A Shah G Folino and N Krasnogor ldquoToward high-throughput multicriteria protein-structure comparison andanalysisrdquo IEEE Transactions on Nanobioscience vol 9 no 2 pp144ndash155 2010

[15] L I Kuncheva and C J Whitaker ldquoMeasures of diversity inclassifier ensembles and their relationship with the ensembleaccuracyrdquo Machine Learning vol 51 no 2 pp 181ndash207 2003

[16] P Sollich and A Krogh ldquoLearning with ensembles howoverfitting can be usefulrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo96) vol 8 pp190ndash196 1996

12 BioMed Research International

[17] G Brown J Wyatt R Harris and X Yao ldquoDiversity creationmethods a survey and categorisationrdquo Information Fusion vol6 no 1 pp 5ndash20 2005

[18] P Chi Efficient protein tertiary structure retrievals and clas-sifications using content based comparison algorithms [PhDdissertation] University of Missouri at Columbia ColumbiaMo USA 2007

[19] A Poleksic ldquoAlgorithms for optimal protein structure align-mentrdquo Bioinformatics vol 25 no 21 pp 2751ndash2756 2009

[20] A A Shah Studies on distributed approaches for large scalemulti-criteria protein structure comparison and analysis [PhDthesis] University of Nottingham Nottingham UK 2011httpeprintsnottinghamacuk11735

[21] M Pharr and R Fernando GPU Gems 2 Programming Tech-niques for High-Performance Graphics and General-PurposeComputation Pearson Eduction Addison-Wesley Professional2005

[22] M Azimi N Cherukuri D Jayashima et al ldquoIntegrationchallenges and tradeoffs for tera-scale architecturesrdquo Intel Tech-nology Journal vol 11 pp 173ndash184 2007

[23] S Sarkar T Majumder A Kalyanaraman and P P PandeldquoHardware accelerators for biocomputing a surveyrdquo in Pro-ceedings of the IEEE International Symposium on Circuits andSystems (ISCAS rsquo10) pp 3789ndash3792 IEEE Paris France June2010

[24] E Kouskoumvekakis D Soudris and E S Manolakos ldquoMany-core CPUs can deliver scalable performance to stochasticsimulations of large-scale biochemical reaction networksrdquo inProceedings of the International Conference onHigh PerformanceComputing amp Simulation (HPCS rsquo15) pp 517ndash524 IEEE Ams-terdam The Netherlands July 2015

[25] A A Shah D Barthel and N Krasnogor ldquoGrid and distributedpublic computing schemes for structural proteomics a shortoverviewrdquo in Frontiers of High Performance Computing andNetworking ISPA 2007 Workshops vol 4743 pp 424ndash434Springer Berlin Germany 2007

[26] E Totoni B Behzad S Ghike and J Torrellas ldquoComparing thepower and performance of Intelrsquos SCC to state-of-the-art CPUsand GPUsrdquo in Proceedings of the IEEE International Symposiumon Performance Analysis of Systems amp Software (ISPASS rsquo12) pp78ndash87 IEEE Computer Society New Brunswick NJ USA April2012

[27] T Bjerregaard and S Mahadevan ldquoA survey of research andpractices of network-on-chiprdquoACMComputing Surveys vol 38no 1 article 1 Article ID 1132953 2006

[28] B Marker E Chan J Poulson et al ldquoProgramming many-corearchitecturesmdasha case study dense matrix computations on theIntel single-chip cloud computer processorrdquo Concurrency andComputation Practice and Experience 2011

[29] G Blake R G Dreslinski and TMudge ldquoA survey of multicoreprocessorsrdquo IEEE Signal Processing Magazine vol 26 no 6 pp26ndash37 2009

[30] P Guerrier and A Greiner ldquoA generic architecture for on-chip packet-switched interconnectionsrdquo in Proceedings of theConference on Design Automation and Test in Europe (DATErsquo00) pp 250ndash256 ACM Paris France 2000

[31] D Atienza F Angiolini S Murali A Pullini L Benini and GDe Micheli ldquoNetwork-on-chip design and synthesis outlookrdquoIntegration vol 41 no 2 pp 1ndash35 2008

[32] R P Mohanty A K Turuk and B Sahoo ldquoPerformanceevaluation of multi-core processors with varied interconnectnetworksrdquo in Proceedings of the 2nd International Conference onAdvanced Computing Networking and Security (ADCONS rsquo13)pp 7ndash11 IEEE Computer Society Mangalore India December2013

[33] S Isaza Multicore architectures for bioinformatics applications[PhD thesis] University of Lugano Lugano Switzerland 2011

[34] S B Needleman and C D Wunsch ldquoA general method appli-cable to the search for similarities in the amino acid sequenceof two proteinsrdquo Journal of Molecular Biology vol 48 no 3 pp443ndash453 1970

[35] S Sarkar G R Kulkarni P P Pande and A Kalyanara-man ldquoNetwork-on-chip hardware accelerators for biologicalsequence alignmentrdquo IEEE Transactions on Computers vol 59no 1 pp 29ndash41 2010

[36] A Sharma A Papanikolaou and E SManolakos ldquoAcceleratingall-to-all protein structures comparison with TMalign usinga NoC many-cores processor architecturerdquo in Proceedings ofthe 27th IEEE International Parallel and Distributed ProcessingSymposium Workshops amp PhD Forum (IPDPSW rsquo13) pp 510ndash519 IEEE Cambridge Mass USA May 2013

[37] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[38] N Krasnogor and D A Pelta ldquoMeasuring the similarity ofprotein structures by means of the universal similarity metricrdquoBioinformatics vol 20 no 7 pp 1015ndash1021 2004

[39] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[40] N Melot K Avdic C Kessler and J Keller ldquoInvestigationof main memory bandwidth on Intel Single-Chip Cloudcomputerrdquo in Proceedings of the 3rd Many-Core ApplicationsResearch Community Symposium (MARC rsquo11) pp 107ndash110Ettlingen Germany July 2011

[41] S Saini H Jin R Hood D Barker P Mehrotra and RBiswas ldquoThe impact of hyper-threading on processor resourceutilization in production applicationsrdquo in Proceedings of the18th International Conference on High Performance Computing(HiPC rsquo11) pp 1ndash10 Bangalore India December 2011

[42] H Gonzalez-Velez and M Leyton ldquoA survey of algorithmicskeleton frameworks high-level structured parallel program-ming enablersrdquo Software Practice and Experience vol 40 no12 pp 1135ndash1160 2010

[43] D K G Campbell ldquoTowards the classification of algorithmicskeletonsrdquo Tech Rep YCS 276 1996

[44] A Silberschatz P B Galvin and G Gagne Operating SystemConcepts John Wiley amp Sons Wiley Publishing 8th edition2008

[45] D M Tax and R P W Duin ldquoFeature scaling in support vectordata descriptionsrdquo in Learning from Imbalanced Datasets NJapkowicz Ed pp 25ndash30 AAAI Press Menlo Park Calif USA2000

[46] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

BioMed Research International 13

[47] R Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing ViennaAustria 2015 httpwwwr-projectorg

[48] M Veeramalai D R Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 article 138 2010

[49] G Maccari G L B Spampinato and V Tozzini ldquoSecStAnTsecondary structure analysis tool for data selection statisticsand models buildingrdquo Bioinformatics vol 30 no 5 pp 668ndash674 2014

[50] S Mertens ldquoThe easiest hard problem number partitioningrdquoin Computational Complexity and Statistical Physics A PercusG Istrate and C Moore Eds pp 125ndash139 Oxford UniversityPress New York NY USA 2006

[51] OpenMP API for parallel programming version 40 httpopenmporgwp

[52] J Handl J D Knowles and D B Kell ldquoComputational clustervalidation in post-genomic data analysisrdquo Bioinformatics vol21 no 15 pp 3201ndash3212 2005

[53] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[54] M P Forum ldquoMPI amessage-passing interface standardrdquo TechRep University of Tennessee Knoxville Tenn USA 1994

[55] Q Zou X-B Li W-R Jiang Z-Y Lin G-L Li and K ChenldquoSurvey of MapReduce frame operation in bioinformaticsrdquoBriefings in Bioinformatics 2013

[56] N Malod-Dognin and N Przulj ldquoGR-Align fast and flexiblealignment of protein 3D structures using graphlet degreesimilarityrdquo Bioinformatics vol 30 no 9 pp 1259ndash1265 2014

[57] W Xie and N V Sahinidis ldquoA reduction-based exact algorithmfor the contact map overlap problemrdquo Journal of ComputationalBiology vol 14 no 5 pp 637ndash654 2007

[58] L P Chew and K Kedem ldquoFinding the consensus shape for aprotein family (extended abstract)rdquo in Proceedings of the 18thACM Annual Symposium on Computational Geometry (SCGrsquo02) pp 64ndash73 ACM Barcelona Spain June 2002

[59] D Fischer A Elofsson D Rice and D Eisenberg ldquoAssessingthe performance of fold recognition methods by means ofa comprehensive benchmarkrdquo in Proceedings of the PacificSymposium on Biocomputing pp 300ndash318 January 1996

[60] B Rost and C Sander ldquoPrediction of protein secondary struc-ture at better than 70 accuracyrdquo Journal of Molecular Biologyvol 232 no 2 pp 584ndash599 1993

[61] A Caprara R D Carr S Istrail G Lancia and B Walenzldquo1001 Optimal PDB structure alignments integer programmingmethods for finding the maximum contact map overlaprdquoJournal of Computational Biology vol 11 no 1 pp 27ndash52 2004

[62] R Andonov N Yanev and N Malod-Dognin ldquoAn efficientlagrangian relaxation for the contact map overlap problemrdquo inAlgorithms in Bioinformatics 8th InternationalWorkshopWABI2008 Karlsruhe Germany September 15ndash19 2008 ProceedingsK A Crandall and J Lagergren Eds vol 5251 of Lecture Notesin Computer Science pp 162ndash173 Springer Berlin Germany2008

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 3: Research Article Efficient Multicriteria Protein Structure ...mark MPBT data. Legacy soware by default does not mark memory as MPBT and runs without modi cations. If the bit is set,

BioMed Research International 3

Data119876 query protein structures119863 database of known protein structures Sequential processingcalc usm dist(119876119863) Parallel processingfor 119898 in [CE TMalign] do

for 119902 in 119876 dofor 119889 in119863 do

jobsadd(in PE 119899 using method119898 protein pair [119902 119889])end

endendPar calc psc dist(jobs)

Algorithm 1 A pseudo-code of the MCPSC computation implemented in this work The USM jobs are processed sequentially while CEand TMalign pairwise structure comparison jobs are executed in parallel Task par calc psc dist assigns each PSC job to a free PE (node in amany-core and thread in multicore) If n is specified the jobrsquos assignment is fixed otherwise the next free PE available is employed

Table 1 Salient features of the SCC Chip by Intel

Corearchitecture

6 times 4 mesh 2 Pentium P54c (times86) coresper tile

Local cache 256KB L2 Cache 16 KB shared MPB pertile

Main memory 4 iMCs 16ndash64GB total memory

different purposes including processing and communicationA reserved bit in the page table of the cores is used tomark MPBT data Legacy software by default does not markmemory as MPBT and runs without modifications If the bitis set data is cached in L1 and bypasses L2

To benchmark performance we run all-to-all PSC usingdifferent methods on a single and on multiple cores of theSCC Each core is a P54C Intel Pentium (32-bit) running SCCLinux We have also used a PC with a 3GHz AMD AthlonII X2 250 64-bit processor and 3GB RAM running DebianSid to analyze the characteristics of individual PSC methodsand develop PSC job partitioning schemes Lastly amulticorebaseline was established by running all-to-all MCPSC usingthe multithreaded version of MCPSC on a Intel Core i7-4771 ldquoHaswelrdquo Quad-Core CPU running at 35 GHz with8GB of RAM and a SSD running Debian Sid The Core i7CPU features highly optimized out-of-order execution andHT (Hyper Threading) [41] Intels flavor of SimultaneousMultithreading (SMT) All software developed was compiledwith the GNU C Compiler version 48

22 Software Framework for Implementing PSC Methods Weintroduced in [36] a framework for porting a PSC methodto the SCC and have used it to implement a master-slavesvariant of the TMalign PSC method [39] The frameworkallows developing efficient parallel implementations usingbasic functionality provided by our rckskel skeletons libraryfor a many-core processor Algorithmic skeletons allowsequential implementations to be parallelized easily without

specifying architecture dependent details [42] By nestingand combining skeletons the desired level of parallelism canbe introduced into different PSC methods While severalalgorithmic skeleton libraries have been developed [42 43]none of them has been targeted specifically for many-coreprocessor architectures Using rckskel allowed efficient usageof the SCC cores to speed up all-to-all PSC In this work weextend the framework to handle MCPSC while allowing themethods to be combined serially where required

In Algorithm 1 we provide the pseudo-code of the Mul-ticriteria PSC application for the general case of comparingevery element in a set of query proteins119876with every elementin a set of a reference database of proteins 119863 using threePSC methods It covers both the ldquoall-to-allrdquo comparisonwhere the set of query proteins is the same as the setof database proteins and the ldquomany-to-manyrdquo comparisonwhere the query proteins are different from the databaseproteins We used the MCPSC application developed tomeasure completion times of all-to-all PSCThe size of the joblist generated for parallel processing consists of119870119876times119870119863times119872PSC jobs where 119870119876 (119870119863) is the number of proteins in set119876 (119863) respectively and 119872 is the number of PSC methodsimplemented using parallel processing (119872 = 2 currently)This is the most fine grained job distribution that can beobtained without parallelizing the pairwise PSC operation

The rckskel library [36] was upgraded in this work toC++ allowing for more flexibility in usage The choice of theprogramming language was motivated by two requirementsseamless handling of communication data and type safetyIn order to perform parallel operations data must be passedbetween the cores of the SCC In addition to the controlstructures built into the library this includes user data Sincethe user data can vary depending on the user application thelibrary needs to provide support for any data structures tobe passed between the cores This was achieved by using theboost object serializationdeserialization construct whichallows the library to accept any arbitrary data type from theuser as long as it can be serialized By making use of the C++

4 BioMed Research International

template extension the skeletons in rckskel are completelytype safe This implies that any inconsistency in the dataflow described by the user while constructing the skeletonhierarchy is caught at compile time instead of run time whichsignificantly simplifies developing complex parallel programsusing the library

23 MCPSC Application Development for Many-Core Proces-sor In order to perform Multicriteria PSC an applicationwas developed using our software framework incorporatingthree PSC methods For the CE and USM methods weported existing software to the SCC The implementation ofCE builds on existing C++ sources which require a smallstructure manipulation library as an external binary Thislibrary was modified in order to link it to the main CEcode and deprecate the need for the program to run as anexternal binary Further the code was modified to compareonly the first chains of the domains as in TMalign Wethen generated amaster-slaves parallel implementation of theCC++ code using rckskel similar to the master-slaves portwe have developed for TMalign in [36] We also developeda C++ implementation of USM based on the existing Javasources [38] so that we could integrate it with the restof the software As per the existing implementation theKolmogorov complexity of the proteins was approximatedwith the gzip of their contact maps Therefore we takeit that the normalized compression distance between thecontact maps of a pair of proteins expresses their USMsimilarity In the final implementation the master processexecutes the USM jobs sequentially due to the fast processingtimes for these jobs Subsequently the master distributesthe CE and TMalign PSC jobs to the available slaves tobe executed in parallel on different cores Moreover weextended significantly the previous implementation [36] byadding load balancing The software framework supportsboth dynamic round-robin [44] and static jobs partitioningthus allowing experimentation with different methods andcomparison in order to select the best approach for the givenmany-core processor and dataset

24 MCPSC Application Development for Multicore ProcessorIn addition to implementing the MCPSC framework for theSCCwe also developed an implementation formulticore pro-cessors Intercore communication calls handled in the NoCimplementation by rckskel are replaced by reference passingbetween multiple-threads implemented using OpenMP Theresulting software combines the three methods USM CEand TMalign in the same way as the NoC implementationThe USM jobs are therefore processed serially while CE andTMalign are processed in parallel Since the three methodsare combined into one executable the structure data for eachprotein domain is loaded only once and reused whenever thedomain appears in a pairwise comparison being performedThis reduces the overall processing time as compared torunning each method currently available software whereonly pairs are accepted thereby requiring multiple reloads ofeach protein domain to perform all-to-all analysis (119873 + 1times for each domain if the dataset contains 119873 domains)

Further the software allows direct comparison between theperformance of many-core and multicore processors for theMCPSC problem because all other factors algorithms anddata remain constant

25 MCPSC Consensus Score Calculation In both imple-mentations for each protein pair a total of 119872 comparisonscores are returned where119872 is the number of PSC methodsemployed These results are stored in a 119875 times119872matrix where119875 is the total number of protein pairs in a dataset Individualmethod comparison scores may or may not be normalizeddepending on the PSC method used therefore we normalizeall scores in the matrix between 0 and 1 along each column(feature scaling [45]) that is along each PSM method usedThe normalization process proceeds as follows First themin-imum and maximum values for each column of the matrixcontaining the dissimilarity values are determined Thenthe matrix is scaled by dividing the difference of columnelements and the column minimum value by the differenceof the columnmaximum andminimum value as shown in (1)(where 119894 is the row index and 119895 is the column index) Notethat the minimum value of any column cannot be less thanzero because these are PSC scores with a minimum value ofzero Once the columns have been normalized a consensusMCPSC score is calculated for each protein pair by taking theaverage of the 119872 values corresponding to that protein pairIt must be noted that the smaller the pairwise PSC score thehigher the similarity between the pair proteins

1198831015840

119894119895 =

119883119894119895 minusmin (119883119895)

max (119883119895) minusmin (119883119895) (1)

26 Datasets Used Several datasets listed in Table 2 wereused in this work The table includes statistics about thelength of the protein domains and the distribution of theSCOP [46] classifications in each dataset It can be seen thatthere is significant variation in the sizes of the datasets thelengths of protein domains they contain and the distributionof the SCOP classifications For each dataset we retaineddomains where the PDB file could be downloaded and aclassification could be found for the domain in SCOP v175Further the scripts available for download with the USMsources were used to extract the contact maps from the PDBfiles The Chew-Kedem (CK34) and the Rost-Sander (RS119)datasets were used to develop the load balancing schemesfor the SCC and to study the speedup (throughput) on thei7 processor Further all the datasets were used to comparethe performance of the SCC with that of the i7 on all-to-allMCPCS processing

27 F-Measure Analysis for the Chew-Kedem Dataset Weobtained all-against-all comparison scores for all comparisonmethods and based on these scores we performed pairwisehierarchical clustering for each method using hclust fromthe stats package in 119877 [47] We calculated the 119865-measureof the hierarchical clustering for each PSC methods (TMa-lign USM CE and MCPSC) The 119865-measure is calculated

BioMed Research International 5

Table 2 Basic statistics of the domains in the datasets used in this work The table includes the number of SCOP families superfamilies(SpFams) and folds as well as the total number of domains (No) the Minimum (Min) Maximum (Max) Median and Mean and StandardDeviation (Std) of the domain lengths in each dataset

Dataset No Domain lengths SCOP classificationsMin Max Median Mean Std Families SpFams Folds

Skolnick [57] 33 97 255 158 1677 627 5 5 5Chew-Kedem [58] 34 90 497 147 1795 1001 10 9 9Fischer [59] 68 62 581 181 2206 1256 56 44 40Rost-Sander [60] 114 21 753 167 1932 1234 93 85 71Lancia [61] 269 64 72 68 679 24 79 72 57Proteus [62] 277 64 728 239 2476 1167 53 47 41

Table 3 Time required for the baseline all-to-all PSC task using the TMalign CE and USM PSC methods on the PC and a single core of theSCC The table also shows the time required for the all-to-all MCPSC task (where all three PSC methods are used) All times are in seconds

MethodPorted software on PC Ported software on SCC 1 core

CK34 RS119 CK34 RS119Load Processing Load Processing Load Processing Load Processing

TMalign 001 127 005 1725 030 2514 1 33452CE 080 374 3 6459 14 7152 50 132205USM 001 060 004 7 070 30 2 345All 080 502 3 8191 15 9697 53 166000

using (2) where 119862 is the number of clusters (119862 = 5 in thiswork) as described in [48]

True Positives (TP)

=

119862

sum

119894=1

number of correctly assigned domains

False Positives (FP)

=

119862

sum

119894=1

number of incorrectly assigned domains

False Negatives (FN)

=

119862

sum

119894=1

number of missing domains

Precision (P) = TPTP + FP

Recall (R) = TPTP + FN

F-measure (F) = 2 lowast 119875 lowast 119877119875 + 119877

(2)

3 Results and Discussion

31 Baseline All-to-All PSC Table 3 provides the times takento perform all-to-all comparison on the PC and on a singlecore of the SCC with the CK34 and RS119 datasets Thesoftware took longer to load data and process jobs on a singlecore of the SCC as compared to the PC The longer data

loading times are due to the need for network IO since thedata is stored on the Management Console PC (MCPC) andaccessed by the SCC via NFS In long running services how-ever data is loaded once and used when needed thereforethis time is disregarded in our performance comparisonsThedifferences in processing times are due to the architecturedifference of the processor cores (times86-64 versus times86) andtheir operating frequencies (3GHz versus 533MHz) on thePC and a single SCC core respectively

We performed a statistical analysis using SecStAnT [49]of the proteins participating in the 10 slowest and the 10 fastestpairwise PSC tasks (using TMalign and CE) using the CK34and RS119 datasetsThe parameters on which SecStAnT com-pares proteins are their secondary structure characteristicsBond Angle Dihedral Angle Theta and Psi Results of the 1-parameter and 2-parameter statistical analysis distributionsand correlations did not show any significant statistical differ-ence between the slowest and fastest PSC domains Howeverthe slowest pairs contain the Hlx310Alpha supersecondarystructure which is absent from the fastest pairs The 310-helixoccurs rarely in protein domains because the tight packingof the backbone atoms makes it energetically unstable Thisstructure might be responsible for the increased alignmenttimes in certain domains Further we compared the proteinsusing their UniPort entries and observed that most proteinsin the slowest set are Cytoplasmic while most members inthe fastest set are membrane proteins

32 Relation of Expected PSC Times to the Lengths of theProteins underComparison Comparison of the pairwise PSCprocessing with respect to the protein lengths as shown inFigure 2 reveals some interesting facts the time requiredfor completing a pairwise PSC task is clearly a function of

6 BioMed Research International

Table 4 Sum of squares of residuals after curve fitting for the different PSC methods The table shows that the sum of residuals is lower withquadratic fit (square) as compared to linear fit All values are multiplied with 1119890 + 06

PSC methodResiduals (sum of lengths) Residuals (product of lengths)

CK34 RS119 CK34 RS119Linear Square Linear Square Linear Square Linear Square

TMalign 4 1 100 90 1 1 40 30CE 200 100 4000 1000 100 50 2000 800

0123456789

01 02 03 04 05 06 07 08 09 1

Tim

e (se

c)

0123456789

Tim

e (se

c)

Sum of lengths of pair proteins

01 02 03 04 05 06 07 08 09 1

CK34 dataset

CETMalign

CETMalign

CETMalign

CK34 dataset

0

5

10

15

20

25

0 01 02 03 04 05 06 07 08 09 1Ti

me (

sec)

Tim

e (se

c)Sum of lengths of pair proteins

RS119 dataset

CETMalign

00 01 02 03 04 05 06 07 08 09 1Product of lengths of pair proteinsProduct of lengths of pair proteins

RS119 dataset25

20

15

10

5

0

Figure 2 Comparison of pairwise PSCprocessing times usingTMalign andCEwith respect to the normalized sumof lengths andnormalizedproduct of lengths of pair proteins The dotted lines show the quadratic best fit for the data

the lengths of the proteins forming the pair USM requiresextremely small pairwise comparison times as shown inTable 1 and was thus excluded from this analysis The bestfit is achieved when using the product of lengths of the pairproteins as an attribute and is better than the fit achievedwhen using the sum of the lengths Moreover the quadraticcurve fitting results are better than the linear as shown inTable 4 For a given PSC method the relative times requiredfor completing a pairwise PSC task depend on the proper-ties of the pairs participating However the absolute timerequired for completing a pairwise PSC task depends also onthe complexity of the method used Since the complexity ofone pairwise PSC method as compared to another cannot be

known a priori it cannot be used as a factor for a generic jobspartitioning scheme for load balancing purposes

33 Speedup and Efficiency Analysis of Multiple Criteria PSCon the SCC NoC CPU In this work we experimented withboth round-robin and job partitioning schemes for loadbalancing Given the close correlation between lengths of pairproteins and processing timeswe used this as criteria for bothRound-robin [44] is a dynamic load balancing method inwhich the master process maintains a list of jobs and handsthe next job in the list to the next free slave process It isworth noting that if the list is presorted based on attributesthat are known to be proportional to the processing times this

BioMed Research International 7

Table 5 Comparison of speedup and efficiency achieved by the load balancing schemes in processing the all-to-all MCPSC task vis-a-vie thesingle core (SCC) processing times The efficiency is calculated assuming 47 processing elements (one PE serves as master) All times are inseconds

Load balancing scheme Dataset CK34 Dataset RS119Time Speedup Efficiency Time Speedup Efficiency

Serial (1 SCC core) 9698 mdash mdash 166000 mdash mdashRandom 479 20 043 4765 35 074Greedy partitioning (sum) 409 24 050 4300 39 082Greedy partitioning (product) 396 28 059 4433 38 080Round-robin (sum sorted) 239 41 086 3930 42 090Round-robin (product sorted) 238 41 086 3930 42 090

would naturally result in little to no idle times for the slaveprocesses For static job partitioning determining job-to-coremapping is equivalent to constructing119873 equal partitionsfor a list of elements which is an NP-hard problem [50]In our case 119873 is the number of cores and is equal to 47when all SCC cores are used We investigated methods forPSC jobs partitioning based on the sum and the product ofthe lengths of the two proteins to be compared We havestatically partitioned the PSC jobs using as attribute the sumof lengths (product of lengths) of the paired proteins to becompared The partitions are generated using the LongestProcessing Time (LPT) algorithm the jobs are sorted indescending order by sum of lengths (product of lengths)and then assigned to the partition with the smallest totalrunning sumThese partitions will be referred to as ldquoGreedyrdquopartitions fromnow on Since the job partitions are processedon different cores this strategy is expected to balance theprocessing times by reducing idle times

Figure 3 shows the partitions generated using a randompartitioning and a greedy partitioning scheme based on thesum of lengths and the product of lengths as attributes Parti-tions created with the random scheme with equal number ofpair proteins per partition vary greatly in terms of the totalnormalized sum of the sum (or product) of the pair proteinlengths The size of this normalized sum is expected to beproportional to the overall processing time each partitionwillrequire when assigned to a core Therefore if each partitionof PSC tasks created with the random scheme is processedon a different core several cores will have significant idletimes at the end On the contrary partitions created with thegreedy scheme are almost equal as to the normalized sum ofthe sum (or product) of the pair protein lengths Hence ifeach partition of PSC tasks created by the greedy scheme isprocessed on a different processing element (PE) idle timesare expected to be minimized

We extended the framework introduced in [36] by incor-porating more PSC methods (CE and USM) in additionto TMalign and computing consensus scores in order toperform Multicriteria PSC We retain the master-slaves pro-cessing model in the extended framework the first availablecore runs the master process and all other cores run slaveprocesses The master loads the data and generates a list ofjobs each one involving a single pairwise Protein StructureComparison operation to be performed using a specifiedPSC method Due to the vastly different processing times

0163248

0163248

0163248

0 2 4 6 8 10 12

Cor

e Id

Random split CK34 dataset

0 05 1 15 2 25 3 35 4

0 05 1 15 2 25 3 35 4

Cor

e Id

Greedy (sum) split CK34 dataset

Cor

e Id

Greedy (prod) split CK34 dataset

Figure 3 The 47 partitions created from list of PSC tasks sortedrandomly or by the sum (product) of lengths of pair proteins Eachhorizontal line represents the sum of the normalized lengths of theprotein pairs assigned to that partition

of PSC operations when using different methods we allowthe master process (a) to make a choice between sequentialversus parallel job processing and (b) implement a loadbalancing scheme for the slavesThe first allows the master tosequentially process jobs of a PSC method when distributingthem which would actually result in an increased processingtime The second allows the master to reduce idle times inslave processes which perform PSC continuously receivingjobs from and returning the results to the master till theyreceive a terminate signal

The consensus value for a protein pair is calculated usingthe scores from the three PSC methods (a) the USM-metricfrom USM (b) the TM-score from TMalign and (c) the RootMean Square Distance (RMSD) from CE The USM-metricand RMSD are distances and have a zero value for identicalstructures The TM-score is therefore inverted since it is theonly one of the three which is a similarity metric At the endthe average of the three normalized dissimilarity scores istaken as the MCPSC score for the protein pair

In Table 5 we compare the performance of the master-slaves setup with one master and 47 slaves to that of a singlecore of the SCC and show the speedup and the efficiencyachieved using the CK34 and RS119 datasets TMalign andCE jobs were run under the master-slaves model while USM

8 BioMed Research International

Table 6 Speedup efficiency and throughput in pairwise-PSC tasks per second for performing MCPSC on an Intel Core i7 multicore CPUusing the CK34 and RS119 datasets

Threads Dataset CK34 Dataset RS119Speedup Efficiency Throughput Speedup Efficiency Throughput

1 100 100 492 100 100 3492 188 094 927 146 073 5093 204 068 1001 198 066 6904 254 063 1247 260 065 9085 270 054 1325 270 054 9426 271 045 1332 271 045 9457 274 039 1347 269 038 9388 264 033 1298 267 033 931

jobs were processed by the master due to the negligible timerequired by these jobs

The greedy static partitioning scheme improves on theperformance of the random partitioning scheme but it isinadequate for the SCC where the cost of data transferbetween cores is low This scheme would be of interestin clusters of workstations where interconnection networksare slow The greedy partitioning scheme could also bebeneficial on larger many-core processors where the mastermay become a bottleneck making off-line creation of batchesof PSC tasks essential In such a scenario however othertechniques such as work-stealing would need to be assessedbefore selecting the most appropriate method

The round-robin dynamic job assignment strategy out-performs the best offline partitioning scheme Furthermorethe sorted round-robin flawlessly balances out the cores pro-cessing times on both datasets making it the best approachfor distributing jobs to the cores of the SCC In this approachthe protein pairs are presorted in descending order of product(or sum) of pair protein lengths Specifically for the one-to-many PSC case we simply retrieve the database proteins indescending order of length and create pairwise PSC taskswiththe query protein For the many-to-many PSC case howeverwe need to determine the correct order of the proteins whenthe query proteins are received

The sorted round-robin strategy distributes the jobs mostefficiently for the proposed master-slaves setup on the IntelSCC with 47 slaves This setup achieves a 42-fold (40-fold)speedup for the RS119 (CK34) dataset as compared to a singlecore of the SCCThe speedup is almost linear which suggeststhat high efficiency can be achieved evenwhen themany-coreprocessor has more nodes A bigger many-core processorhowever may require using a hierarchy of masters to avoida bottleneck on a single master node In such a scenarioa combination of partitioning schemes and round-robinjob assignment could be used to distribute and minimizeidle times For instance on larger NoC processors a well-balanced solution may require the use of clusters of process-ing elements concurrently processing PSC jobs with differentmethods All PEs computing jobs of a PSC method wouldreceive jobs from a specific submaster All such submasterswould in turn receive jobs from a common global master

34 Comparison of Multicriteria PSC Performance on Multi-core and Many-Core CPUs In order to perform this exper-iment the MCPSC software framework was retargeted tothe Intel Core i7 multicore processor using openmp [51] toimplement multithreading This experiment allowed us toassess how the MCPSC problem scales with increasing num-ber of cores on a modern multicore CPU readily availablefor scientists and engineers This multicore software versionuses shared memory to replace the RCCE based messagepassing (recv and send) calls used in the many-core versiondeveloped for the Intel SCC As shown in Table 6 we observespeedup on this multicore processor when running all-to-allMCPSCwith the CK34 and RS119 datasets up to the 4 threadsconfigurations Thereafter a steep speedup drop (efficiencyloss) is observed

This is attributed to the fact that this is a quad-coreprocessor that implements Hyper Threading (HT) [41] Sinceall threads operate on exactly the same type of instruc-tions workload (SPMD) it cannot take advantage of eachcorersquos super scalar architecture that needs varied workloadplacement (eg integer and floating point arithmetic at thesame time) to show performance advantages when utilizingmore threads than cores Further an SMP Operating System(OS) like the one we run on our test system uses allsystem cores as resources for swapping threads in and outA thread executing on core 1198751198641 for some number of cyclesmay get swapped out and later resume execution on core1198751198642 Thus the thread cache on core 1198751198641 becomes useless andit gets a lot of cache misses when restarting on 1198751198642 ThisOS behavior impacts cache performance and reduces theoverall application performance In terms of pairwise-PSCtasks per second (pps) the Intel Core i7 achieves a through-put of 1347 pps (945 pps) on the CK34 (RS119) dataset ascompared to 553 pps (363 pps) achieved by the SCC Thisis due to the fact that the multicore CPU contains latestgeneration cores featuring a highly superior out-of-ordermicroarchitecture and clocked at almost 75x the frequencyof the SCC We believe that even more performance can beexploited from next generation multicore processors shouldthey start introducing hardware memory structures like theMessage Passing Buffer (MPB) of the SCC NoC processoralongside their cores for increased communication efficiency

BioMed Research International 9

Table 7 Comparison of the SCC and i7 CPU in terms of throughputdelivered on all-to-all MCPSC

Dataset Pairs SCCthroughput

i7throughput Ratio

Skolnick 1089 605 2792 462Chew-Kedem 1156 503 1329 264Fischer 4624 198 937 474Rost-Sander 12996 330 938 285Lancia 72361 mdash 22265 mdashProteus 76729 mdash 677 mdash

among them without resorting to shared memory and itsunavoidable locks or cache-coherency protocol overheads

Finally Table 7 shows the comparative performance ofthe SCC (with 48 cores) and the i7 (running 7 threads)on all-to-all MCPSC using several datasets It can be seenthat the i7 outperforms the SCC consistently in terms ofthroughput The two larger datasets Lancia and Proteuswere not processed on the SCC because of the limited per-core memory (512MB) which was not sufficient to load thedomain structure data for the full dataset A decrease inthroughput was observed as the size of the dataset increaseswith some exceptions Both the SCC and the i7 deliver lowerthan expected throughput (based on its size) on the Fischerdataset which we believe is due to the higher complexity ofthe dataset As evidenced by Table 2 the Fischer dataset hasdomains belonging to 5 times more SCOP Super Families ascompared to CK34 and also has the second highest medianlength of protein domains among all the datasets used in thiswork Conversely the throughput delivered by the i7 on theLancia dataset is significantly higher than that delivered forthe similar size (in terms of number of domains) Proteusdataset We attribute this difference to the large differencebetween themean lengths of domains in each dataset as notedin Table 2

35 Qualitative Analysis of MCPSC Results Based on theChew-Kedem Dataset The acceleration capabilities offeredby modern processors (many-core and multicore) enablesperforming all-to-all PSC using different methods on large-scale datasets and compare results For example it becomeseasy to perform a qualitative analysis on a set of proteindomains and use consensus score (GPrsquoS) to explore relationbetween the protein domains categorize them into bio-logically relevant clusters [48] and automate generation ofdatabases such as Structural Classification of Proteins (SCOP)[46]

The MCPSC comparison method outperforms the com-ponent PSCmethods (TMalign CE andUSM) and groups 34representative proteins fromfivefold families into biologicallysignificant clusters In Figure 4 each protein domain is usingthe format ldquodomainName foldFamiliesrdquo The tags of theldquofoldFamiliesrdquo belong to one of (i) tb TIM barrel (ii) gglobins (iii) ab alpha beta (iv) b all beta and (v) a all alphaprotein families The clusters obtained from the MCPSCmethod show that most domains are grouped correctly

according to their structural fold In comparison all thecomponent PSCmethods produce clusters in which there aremany wrongly grouped domains Details of domain clusterallocation of all four methods are included in the Supple-mentary Materials file (see Supplementary Material availableonline at httpdxdoiorg1011552015563674) dom clustxlswith clusters generated by USM containing the highestnumber of errors The quantitative analysis evaluating theRecall versus Precision tradeoff based on the 119865-measure [52]results in a value of 119865 = 091 for the MCPSC method whichis higher than TMalign (119865 = 082) CE (119865 = 071) andUSM (119865 = 062) Finally in the Supplementary Materialfile supplementary materialpdf weprovide clusters generatedby hclust [47] for the four PSC methods (see Figures S1S2 and S3) PSC scores generated for all the datasets for allPSCmethods are included in the SupplementaryMaterial filepsc scoreszip

4 Conclusions and Future Work

In the near future we will see a continued increase inthe amount of cores integrated on a single chip as well asincreased availability of commodity many-core processorsThis trend is already visible today in that off-the-shelf PCscontain processors with up to 8 cores server grade machinesoften contain multiple processors with up to 32 cores andmulticore processors are even appearing in average grademobile devices Further the success of GPUs Tilerarsquos archi-tecture and Intelrsquos initiative for integrating more and morecores on a single chip also attest to this trend The ubiquityand the advanced core architectures employed by multicoreCPUsmake it imperative to build software that can efficientlyutilize their processing power Our experiments show that amodern Core i7 is able to deliver high throughput for all-to-all MCPSC Utilizing the performance gain delivered bymul-ticoreCPUbecomes especially importantwhen considered incombination with distributed setups such as clusters whichmay already contain nodes with multiple cores

Both multicore and many-core architectures lend them-selves to the popular and powerful MapReduce parallelprogramming model [53] If the underlying communicationlibrariesmdashtypically built on top of Message Passing Interface(MPI) [54]mdashare available this highly popular approachcan be applied to large-scale MCPSC in a manner similarto its application in other domains of bioinformatics [55]While some implementation of MapReduce is available formulticore machines it must be noted that no MapReduceframework implementations are available for the SCC Onepossible approach for using MapReduce in MCPSC wouldbe to (a) determine pairwise PSC to be performed on eachPE (pre-Map) (b) run all PSC methods on all pairs on eachPE in parallel (Map) (c) collect PSC scores from all PEsto a designated PE (shuffle) and (d) combine and generatefinal MCPSC scores (Reduce) Movement of data for taskdistribution (pre-Map) intermediate data transfer performedby the ldquoshufflerdquo mechanism and collection of results ifrequired by the Reduce step would benefit from low cost ofinter-PE communication Load balancing methods similar to

10 BioMed Research International

6xi

a_tb

1jh

g_a

1m

ba_g

1flp

_g1

ith_g

5m

bn_g

1m

yt_g

1ec

a_g

2hb

g_g

1ba

b_g

1as

h_g

3sd

h_g

2lh

b_g

1hl

b_g

1hl

m_g

2vh

b_g

1lh2

_g2

mnr

_tb

1ch

r_tb

4en

l_tb

5p2

1_a

b1

qra_

ab1

gnp_

ab6

q21

_ab

1aa9

_ab

1qf

o_b

1ne

u_b

1cd8

_b1

qa9

_b1

hnf_

b1

cdb_

b1

ci5

_b1

ct9

_ab

1cn

p_a

00

01

02

03

04

05

06

MCPSC

Figure 4 Hierarchical clustering result using the Chew-Kedem dataset and MCPSC score as distance metric between domains Each boxrepresents a cluster and the domains belonging to it The average linkage method was used to build the dendrogram

those discussed in this work could be useful in determiningeffective strategies for the distribution of PSC tasks (pre-Map) Analysis of MCPSC performance on multicore andmany-core processors withMapReduce would be required tounderstand the design trade-offs that would lead to efficientdesigns

Scalability limitation of the bus-based architecture cou-pled with the greater potential of Network-on-Chip basedprocessors to deliver efficiency and high performance impliesthat NoCs are likely to be used extensively in future many-core processors It is therefore important to start developingsoftware frameworks and solutions that capitalize on thisparallel architecture to meet the increasing computationaldemands in structural proteomics and bioinformatics ingeneral Comparison of many-core and multicore processorsin this work shows that the former can deliver higherefficiency than the latter processor architectures Furtherour experimental results make it clear that Intelrsquos SCC NoCmatches the speedup and efficiency achieved by a clusterof faster workstations [14] However a market ready NoCCPU will have a much better per watt performance ascompared to a cluster while also saving in space and infras-tructuremanagement costs Additionally any savings in wattsconsumed also reflect in savings in cooling infrastructurerequired for the hardware Furthermore with the per wattperformance being in focus for processor manufacturerssuch as Intel and Tilera this gap is set to expand evenfurther Finally many-core processors when used as the

CPUs of near-future clusters of workstations will providean additional level of local parallelism to applications whereperformance scalability is essential to tackle ever increasingcomputational demands As this study demonstrates thelikely increase in the availability and ubiquitousness of many-core processors the near linear speedup in tackling theMCPSC scenario and the ease of porting new PSC methodsto NoC based processors make many-core processors be ofgreat interest for the high performance structural proteomicsand bioinformatics communities in general

Incorporating additional PSC methods to a many-coreNoC processor is straightforward due to the familiar pro-gramming model and architecture of the individual coresOur rckskel library [36] offers convenience and parallel pro-gramming functions to further accelerate development on theNoC On future larger NoC processors optimal solutionsmayrequire the use of clusters of cores concurrently processingPSC jobs with different methods Using the rckskel librarywould facilitate the software development for such morecomplicated scenarios hiding low level details from the usersThe advantages offered by an algorithmic skeletons librarylike the rckskel make it of interest to further develop thelibrary To this end we intend to introduce communicationcontrols that will allow true blocking implementations ofldquosendrdquo and ldquorecvrdquo regardless of the communication backendlibrary used (RCCE eg provides busy-wait loops for block-ing) Such an implementation would increase the energy effi-ciency of applications built using the library Further work on

BioMed Research International 11

the librarywould involve separating the algorithmic skeletonscode from the communication subsystem so that it can beretargeted to other processors This would essentially requiresand-boxing the RCCE calls behindwell-structuredAPIs thatcan allow a different communication library to be plugged in

Further we would like to investigate the performancecharacteristics of all-to-all MCPSC on multicore and many-core processors with very large datasets Experimenting withsuch large datasets like the one used in [56] with more than6000 domains is likely to provide higher granularity andassist in more accurate assessment of the impact of differentload balancing schemes in terms of speedup delivered Addi-tionally cluster analysis on the results of such a large datasetwill provide a better measure of the biological relevance ofclusters generated by MCPSC A dataset of that size wouldgenerate more than 32 million pairwise comparison tasksper PSC method and keeping the number of methods to3 (as is the case in this work) would result in nearly 10million tasks Due to these large numbers processing sucha dataset specifically on a many-core processor would befeasible only with a larger NoC that is one which supportsmore memory and uses cores of newer architectures thanthe SCC Showing that such a large computation can becompleted in reasonable time would however further justifythe drive towards leveraging parallel processing capabilitiesof multicore and many-core processors

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Anuj Sharma contributed to the design of the study devel-oped the software framework and performed the experi-ments Elias S Manolakos conceived the study and con-tributed to the design and evaluation All authors contributedto the discussion of results andwrote and approved the paper

Acknowledgments

Mr Anuj Sharma and Dr Elias S Manolakos would like toacknowledge the support of this research by the EuropeanUnion (European Social FundESF) andGreek national fundsthrough the Operational Program ldquoEducation and LifelongLearningrdquo of the National Strategic Reference Framework(NSRF) Research Funding Program ldquoHeracleitus IIrdquo Theauthors would also like to acknowledge the MicroLab ECENational Technical University of Athens Greece for allowinguse of their SCC infrastructure In addition we thank MrDimitrios Rodopoulos and Professor Dimitrios Soudris ofNTUA for providing comments and assistance

References

[1] R A Laskowski J D Watson and J M Thornton ldquoProFunca server for predicting protein function from 3D structurerdquoNucleic Acids Research vol 33 no 2 pp 89ndash93 2005

[2] B Pang N Zhao M Becchi D Korkin and C-R ShyuldquoAccelerating large-scale protein structure alignments withgraphics processing unitsrdquo BMC Research Notes vol 5 no 1article 116 2012

[3] E Bielska X Lucas A Czerwiniec J M Kasprzak K HKaminska and J M Bujnicki ldquoVirtual screening strategies indrug design methods and applicationsrdquo Journal of Biotechnol-ogy Computational Biology and Bionanotechnology vol 92 no3 pp 249ndash264 2011

[4] H Kubinyi ldquoStructure-based design of enzyme inhibitors andreceptor ligandsrdquo Current Opinion in Drug Discovery andDevelopment vol 1 no 1 pp 4ndash15 1998

[5] G Lancia and S Istrail ldquoProtein structure comparison algo-rithms and applicationsrdquo in Mathematical Methods for ProteinStructure Analysis and Design vol 2666 of Lecture Notes inComputer Science pp 1ndash33 Springer Berlin Germany 2003

[6] I Eidhammer I Jonassen andW R Taylor ldquoStructure compar-ison and structure patternsrdquo Journal of Computational Biologyvol 7 no 5 pp 685ndash716 2000

[7] Y B Ruiz-Blanco W Paz J Green and Y Marrero-PonceldquoProtDCal a program to compute general-purpose-numericaldescriptors for sequences and 3D-structures of proteinsrdquo BMCBioinformatics vol 16 no 1 article 162 2015

[8] D Barthel J D Hirst J Błazewicz E K Burke and NKrasnogor ldquoProCKSI a decision support system for protein(structure) comparison knowledge similarity and informa-tionrdquo BMC Bioinformatics vol 8 article 416 2007

[9] M Arriagada and A Poleksic ldquoOn the difference in qualitybetween current heuristic and optimal solutions to the proteinstructure alignment problemrdquo BioMed Research Internationalvol 2013 Article ID 459248 8 pages 2013

[10] D E Robillard P T Mpangase S Hazelhurst and F DehneldquoSpeeDB fast structural protein searchesrdquo Bioinformatics vol31 no 18 pp 3027ndash3034 2015

[11] M Veeramalai D Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 no 1 article 138 2010

[12] S B Pandit and J Skolnick ldquoFr-TM-align a new proteinstructural alignment method based on fragment alignmentsand the TM-scorerdquo BMC Bioinformatics vol 9 no 1 article 5312008

[13] L Mavridis and D Ritchie ldquo3D-blast 3D protein structurealignment comparison and classification using spherical polarFourier correlationsrdquo in Proceedings of the Pacific Symposium onBiocomputing pp 281ndash292 World Scientific PublishingKamuela Hawaii USA January 2010 httphalinriafrinria-00434263en

[14] A A Shah G Folino and N Krasnogor ldquoToward high-throughput multicriteria protein-structure comparison andanalysisrdquo IEEE Transactions on Nanobioscience vol 9 no 2 pp144ndash155 2010

[15] L I Kuncheva and C J Whitaker ldquoMeasures of diversity inclassifier ensembles and their relationship with the ensembleaccuracyrdquo Machine Learning vol 51 no 2 pp 181ndash207 2003

[16] P Sollich and A Krogh ldquoLearning with ensembles howoverfitting can be usefulrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo96) vol 8 pp190ndash196 1996

12 BioMed Research International

[17] G Brown J Wyatt R Harris and X Yao ldquoDiversity creationmethods a survey and categorisationrdquo Information Fusion vol6 no 1 pp 5ndash20 2005

[18] P Chi Efficient protein tertiary structure retrievals and clas-sifications using content based comparison algorithms [PhDdissertation] University of Missouri at Columbia ColumbiaMo USA 2007

[19] A Poleksic ldquoAlgorithms for optimal protein structure align-mentrdquo Bioinformatics vol 25 no 21 pp 2751ndash2756 2009

[20] A A Shah Studies on distributed approaches for large scalemulti-criteria protein structure comparison and analysis [PhDthesis] University of Nottingham Nottingham UK 2011httpeprintsnottinghamacuk11735

[21] M Pharr and R Fernando GPU Gems 2 Programming Tech-niques for High-Performance Graphics and General-PurposeComputation Pearson Eduction Addison-Wesley Professional2005

[22] M Azimi N Cherukuri D Jayashima et al ldquoIntegrationchallenges and tradeoffs for tera-scale architecturesrdquo Intel Tech-nology Journal vol 11 pp 173ndash184 2007

[23] S Sarkar T Majumder A Kalyanaraman and P P PandeldquoHardware accelerators for biocomputing a surveyrdquo in Pro-ceedings of the IEEE International Symposium on Circuits andSystems (ISCAS rsquo10) pp 3789ndash3792 IEEE Paris France June2010

[24] E Kouskoumvekakis D Soudris and E S Manolakos ldquoMany-core CPUs can deliver scalable performance to stochasticsimulations of large-scale biochemical reaction networksrdquo inProceedings of the International Conference onHigh PerformanceComputing amp Simulation (HPCS rsquo15) pp 517ndash524 IEEE Ams-terdam The Netherlands July 2015

[25] A A Shah D Barthel and N Krasnogor ldquoGrid and distributedpublic computing schemes for structural proteomics a shortoverviewrdquo in Frontiers of High Performance Computing andNetworking ISPA 2007 Workshops vol 4743 pp 424ndash434Springer Berlin Germany 2007

[26] E Totoni B Behzad S Ghike and J Torrellas ldquoComparing thepower and performance of Intelrsquos SCC to state-of-the-art CPUsand GPUsrdquo in Proceedings of the IEEE International Symposiumon Performance Analysis of Systems amp Software (ISPASS rsquo12) pp78ndash87 IEEE Computer Society New Brunswick NJ USA April2012

[27] T Bjerregaard and S Mahadevan ldquoA survey of research andpractices of network-on-chiprdquoACMComputing Surveys vol 38no 1 article 1 Article ID 1132953 2006

[28] B Marker E Chan J Poulson et al ldquoProgramming many-corearchitecturesmdasha case study dense matrix computations on theIntel single-chip cloud computer processorrdquo Concurrency andComputation Practice and Experience 2011

[29] G Blake R G Dreslinski and TMudge ldquoA survey of multicoreprocessorsrdquo IEEE Signal Processing Magazine vol 26 no 6 pp26ndash37 2009

[30] P Guerrier and A Greiner ldquoA generic architecture for on-chip packet-switched interconnectionsrdquo in Proceedings of theConference on Design Automation and Test in Europe (DATErsquo00) pp 250ndash256 ACM Paris France 2000

[31] D Atienza F Angiolini S Murali A Pullini L Benini and GDe Micheli ldquoNetwork-on-chip design and synthesis outlookrdquoIntegration vol 41 no 2 pp 1ndash35 2008

[32] R P Mohanty A K Turuk and B Sahoo ldquoPerformanceevaluation of multi-core processors with varied interconnectnetworksrdquo in Proceedings of the 2nd International Conference onAdvanced Computing Networking and Security (ADCONS rsquo13)pp 7ndash11 IEEE Computer Society Mangalore India December2013

[33] S Isaza Multicore architectures for bioinformatics applications[PhD thesis] University of Lugano Lugano Switzerland 2011

[34] S B Needleman and C D Wunsch ldquoA general method appli-cable to the search for similarities in the amino acid sequenceof two proteinsrdquo Journal of Molecular Biology vol 48 no 3 pp443ndash453 1970

[35] S Sarkar G R Kulkarni P P Pande and A Kalyanara-man ldquoNetwork-on-chip hardware accelerators for biologicalsequence alignmentrdquo IEEE Transactions on Computers vol 59no 1 pp 29ndash41 2010

[36] A Sharma A Papanikolaou and E SManolakos ldquoAcceleratingall-to-all protein structures comparison with TMalign usinga NoC many-cores processor architecturerdquo in Proceedings ofthe 27th IEEE International Parallel and Distributed ProcessingSymposium Workshops amp PhD Forum (IPDPSW rsquo13) pp 510ndash519 IEEE Cambridge Mass USA May 2013

[37] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[38] N Krasnogor and D A Pelta ldquoMeasuring the similarity ofprotein structures by means of the universal similarity metricrdquoBioinformatics vol 20 no 7 pp 1015ndash1021 2004

[39] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[40] N Melot K Avdic C Kessler and J Keller ldquoInvestigationof main memory bandwidth on Intel Single-Chip Cloudcomputerrdquo in Proceedings of the 3rd Many-Core ApplicationsResearch Community Symposium (MARC rsquo11) pp 107ndash110Ettlingen Germany July 2011

[41] S Saini H Jin R Hood D Barker P Mehrotra and RBiswas ldquoThe impact of hyper-threading on processor resourceutilization in production applicationsrdquo in Proceedings of the18th International Conference on High Performance Computing(HiPC rsquo11) pp 1ndash10 Bangalore India December 2011

[42] H Gonzalez-Velez and M Leyton ldquoA survey of algorithmicskeleton frameworks high-level structured parallel program-ming enablersrdquo Software Practice and Experience vol 40 no12 pp 1135ndash1160 2010

[43] D K G Campbell ldquoTowards the classification of algorithmicskeletonsrdquo Tech Rep YCS 276 1996

[44] A Silberschatz P B Galvin and G Gagne Operating SystemConcepts John Wiley amp Sons Wiley Publishing 8th edition2008

[45] D M Tax and R P W Duin ldquoFeature scaling in support vectordata descriptionsrdquo in Learning from Imbalanced Datasets NJapkowicz Ed pp 25ndash30 AAAI Press Menlo Park Calif USA2000

[46] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

BioMed Research International 13

[47] R Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing ViennaAustria 2015 httpwwwr-projectorg

[48] M Veeramalai D R Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 article 138 2010

[49] G Maccari G L B Spampinato and V Tozzini ldquoSecStAnTsecondary structure analysis tool for data selection statisticsand models buildingrdquo Bioinformatics vol 30 no 5 pp 668ndash674 2014

[50] S Mertens ldquoThe easiest hard problem number partitioningrdquoin Computational Complexity and Statistical Physics A PercusG Istrate and C Moore Eds pp 125ndash139 Oxford UniversityPress New York NY USA 2006

[51] OpenMP API for parallel programming version 40 httpopenmporgwp

[52] J Handl J D Knowles and D B Kell ldquoComputational clustervalidation in post-genomic data analysisrdquo Bioinformatics vol21 no 15 pp 3201ndash3212 2005

[53] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[54] M P Forum ldquoMPI amessage-passing interface standardrdquo TechRep University of Tennessee Knoxville Tenn USA 1994

[55] Q Zou X-B Li W-R Jiang Z-Y Lin G-L Li and K ChenldquoSurvey of MapReduce frame operation in bioinformaticsrdquoBriefings in Bioinformatics 2013

[56] N Malod-Dognin and N Przulj ldquoGR-Align fast and flexiblealignment of protein 3D structures using graphlet degreesimilarityrdquo Bioinformatics vol 30 no 9 pp 1259ndash1265 2014

[57] W Xie and N V Sahinidis ldquoA reduction-based exact algorithmfor the contact map overlap problemrdquo Journal of ComputationalBiology vol 14 no 5 pp 637ndash654 2007

[58] L P Chew and K Kedem ldquoFinding the consensus shape for aprotein family (extended abstract)rdquo in Proceedings of the 18thACM Annual Symposium on Computational Geometry (SCGrsquo02) pp 64ndash73 ACM Barcelona Spain June 2002

[59] D Fischer A Elofsson D Rice and D Eisenberg ldquoAssessingthe performance of fold recognition methods by means ofa comprehensive benchmarkrdquo in Proceedings of the PacificSymposium on Biocomputing pp 300ndash318 January 1996

[60] B Rost and C Sander ldquoPrediction of protein secondary struc-ture at better than 70 accuracyrdquo Journal of Molecular Biologyvol 232 no 2 pp 584ndash599 1993

[61] A Caprara R D Carr S Istrail G Lancia and B Walenzldquo1001 Optimal PDB structure alignments integer programmingmethods for finding the maximum contact map overlaprdquoJournal of Computational Biology vol 11 no 1 pp 27ndash52 2004

[62] R Andonov N Yanev and N Malod-Dognin ldquoAn efficientlagrangian relaxation for the contact map overlap problemrdquo inAlgorithms in Bioinformatics 8th InternationalWorkshopWABI2008 Karlsruhe Germany September 15ndash19 2008 ProceedingsK A Crandall and J Lagergren Eds vol 5251 of Lecture Notesin Computer Science pp 162ndash173 Springer Berlin Germany2008

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 4: Research Article Efficient Multicriteria Protein Structure ...mark MPBT data. Legacy soware by default does not mark memory as MPBT and runs without modi cations. If the bit is set,

4 BioMed Research International

template extension the skeletons in rckskel are completelytype safe This implies that any inconsistency in the dataflow described by the user while constructing the skeletonhierarchy is caught at compile time instead of run time whichsignificantly simplifies developing complex parallel programsusing the library

23 MCPSC Application Development for Many-Core Proces-sor In order to perform Multicriteria PSC an applicationwas developed using our software framework incorporatingthree PSC methods For the CE and USM methods weported existing software to the SCC The implementation ofCE builds on existing C++ sources which require a smallstructure manipulation library as an external binary Thislibrary was modified in order to link it to the main CEcode and deprecate the need for the program to run as anexternal binary Further the code was modified to compareonly the first chains of the domains as in TMalign Wethen generated amaster-slaves parallel implementation of theCC++ code using rckskel similar to the master-slaves portwe have developed for TMalign in [36] We also developeda C++ implementation of USM based on the existing Javasources [38] so that we could integrate it with the restof the software As per the existing implementation theKolmogorov complexity of the proteins was approximatedwith the gzip of their contact maps Therefore we takeit that the normalized compression distance between thecontact maps of a pair of proteins expresses their USMsimilarity In the final implementation the master processexecutes the USM jobs sequentially due to the fast processingtimes for these jobs Subsequently the master distributesthe CE and TMalign PSC jobs to the available slaves tobe executed in parallel on different cores Moreover weextended significantly the previous implementation [36] byadding load balancing The software framework supportsboth dynamic round-robin [44] and static jobs partitioningthus allowing experimentation with different methods andcomparison in order to select the best approach for the givenmany-core processor and dataset

24 MCPSC Application Development for Multicore ProcessorIn addition to implementing the MCPSC framework for theSCCwe also developed an implementation formulticore pro-cessors Intercore communication calls handled in the NoCimplementation by rckskel are replaced by reference passingbetween multiple-threads implemented using OpenMP Theresulting software combines the three methods USM CEand TMalign in the same way as the NoC implementationThe USM jobs are therefore processed serially while CE andTMalign are processed in parallel Since the three methodsare combined into one executable the structure data for eachprotein domain is loaded only once and reused whenever thedomain appears in a pairwise comparison being performedThis reduces the overall processing time as compared torunning each method currently available software whereonly pairs are accepted thereby requiring multiple reloads ofeach protein domain to perform all-to-all analysis (119873 + 1times for each domain if the dataset contains 119873 domains)

Further the software allows direct comparison between theperformance of many-core and multicore processors for theMCPSC problem because all other factors algorithms anddata remain constant

25 MCPSC Consensus Score Calculation In both imple-mentations for each protein pair a total of 119872 comparisonscores are returned where119872 is the number of PSC methodsemployed These results are stored in a 119875 times119872matrix where119875 is the total number of protein pairs in a dataset Individualmethod comparison scores may or may not be normalizeddepending on the PSC method used therefore we normalizeall scores in the matrix between 0 and 1 along each column(feature scaling [45]) that is along each PSM method usedThe normalization process proceeds as follows First themin-imum and maximum values for each column of the matrixcontaining the dissimilarity values are determined Thenthe matrix is scaled by dividing the difference of columnelements and the column minimum value by the differenceof the columnmaximum andminimum value as shown in (1)(where 119894 is the row index and 119895 is the column index) Notethat the minimum value of any column cannot be less thanzero because these are PSC scores with a minimum value ofzero Once the columns have been normalized a consensusMCPSC score is calculated for each protein pair by taking theaverage of the 119872 values corresponding to that protein pairIt must be noted that the smaller the pairwise PSC score thehigher the similarity between the pair proteins

1198831015840

119894119895 =

119883119894119895 minusmin (119883119895)

max (119883119895) minusmin (119883119895) (1)

26 Datasets Used Several datasets listed in Table 2 wereused in this work The table includes statistics about thelength of the protein domains and the distribution of theSCOP [46] classifications in each dataset It can be seen thatthere is significant variation in the sizes of the datasets thelengths of protein domains they contain and the distributionof the SCOP classifications For each dataset we retaineddomains where the PDB file could be downloaded and aclassification could be found for the domain in SCOP v175Further the scripts available for download with the USMsources were used to extract the contact maps from the PDBfiles The Chew-Kedem (CK34) and the Rost-Sander (RS119)datasets were used to develop the load balancing schemesfor the SCC and to study the speedup (throughput) on thei7 processor Further all the datasets were used to comparethe performance of the SCC with that of the i7 on all-to-allMCPCS processing

27 F-Measure Analysis for the Chew-Kedem Dataset Weobtained all-against-all comparison scores for all comparisonmethods and based on these scores we performed pairwisehierarchical clustering for each method using hclust fromthe stats package in 119877 [47] We calculated the 119865-measureof the hierarchical clustering for each PSC methods (TMa-lign USM CE and MCPSC) The 119865-measure is calculated

BioMed Research International 5

Table 2 Basic statistics of the domains in the datasets used in this work The table includes the number of SCOP families superfamilies(SpFams) and folds as well as the total number of domains (No) the Minimum (Min) Maximum (Max) Median and Mean and StandardDeviation (Std) of the domain lengths in each dataset

Dataset No Domain lengths SCOP classificationsMin Max Median Mean Std Families SpFams Folds

Skolnick [57] 33 97 255 158 1677 627 5 5 5Chew-Kedem [58] 34 90 497 147 1795 1001 10 9 9Fischer [59] 68 62 581 181 2206 1256 56 44 40Rost-Sander [60] 114 21 753 167 1932 1234 93 85 71Lancia [61] 269 64 72 68 679 24 79 72 57Proteus [62] 277 64 728 239 2476 1167 53 47 41

Table 3 Time required for the baseline all-to-all PSC task using the TMalign CE and USM PSC methods on the PC and a single core of theSCC The table also shows the time required for the all-to-all MCPSC task (where all three PSC methods are used) All times are in seconds

MethodPorted software on PC Ported software on SCC 1 core

CK34 RS119 CK34 RS119Load Processing Load Processing Load Processing Load Processing

TMalign 001 127 005 1725 030 2514 1 33452CE 080 374 3 6459 14 7152 50 132205USM 001 060 004 7 070 30 2 345All 080 502 3 8191 15 9697 53 166000

using (2) where 119862 is the number of clusters (119862 = 5 in thiswork) as described in [48]

True Positives (TP)

=

119862

sum

119894=1

number of correctly assigned domains

False Positives (FP)

=

119862

sum

119894=1

number of incorrectly assigned domains

False Negatives (FN)

=

119862

sum

119894=1

number of missing domains

Precision (P) = TPTP + FP

Recall (R) = TPTP + FN

F-measure (F) = 2 lowast 119875 lowast 119877119875 + 119877

(2)

3 Results and Discussion

31 Baseline All-to-All PSC Table 3 provides the times takento perform all-to-all comparison on the PC and on a singlecore of the SCC with the CK34 and RS119 datasets Thesoftware took longer to load data and process jobs on a singlecore of the SCC as compared to the PC The longer data

loading times are due to the need for network IO since thedata is stored on the Management Console PC (MCPC) andaccessed by the SCC via NFS In long running services how-ever data is loaded once and used when needed thereforethis time is disregarded in our performance comparisonsThedifferences in processing times are due to the architecturedifference of the processor cores (times86-64 versus times86) andtheir operating frequencies (3GHz versus 533MHz) on thePC and a single SCC core respectively

We performed a statistical analysis using SecStAnT [49]of the proteins participating in the 10 slowest and the 10 fastestpairwise PSC tasks (using TMalign and CE) using the CK34and RS119 datasetsThe parameters on which SecStAnT com-pares proteins are their secondary structure characteristicsBond Angle Dihedral Angle Theta and Psi Results of the 1-parameter and 2-parameter statistical analysis distributionsand correlations did not show any significant statistical differ-ence between the slowest and fastest PSC domains Howeverthe slowest pairs contain the Hlx310Alpha supersecondarystructure which is absent from the fastest pairs The 310-helixoccurs rarely in protein domains because the tight packingof the backbone atoms makes it energetically unstable Thisstructure might be responsible for the increased alignmenttimes in certain domains Further we compared the proteinsusing their UniPort entries and observed that most proteinsin the slowest set are Cytoplasmic while most members inthe fastest set are membrane proteins

32 Relation of Expected PSC Times to the Lengths of theProteins underComparison Comparison of the pairwise PSCprocessing with respect to the protein lengths as shown inFigure 2 reveals some interesting facts the time requiredfor completing a pairwise PSC task is clearly a function of

6 BioMed Research International

Table 4 Sum of squares of residuals after curve fitting for the different PSC methods The table shows that the sum of residuals is lower withquadratic fit (square) as compared to linear fit All values are multiplied with 1119890 + 06

PSC methodResiduals (sum of lengths) Residuals (product of lengths)

CK34 RS119 CK34 RS119Linear Square Linear Square Linear Square Linear Square

TMalign 4 1 100 90 1 1 40 30CE 200 100 4000 1000 100 50 2000 800

0123456789

01 02 03 04 05 06 07 08 09 1

Tim

e (se

c)

0123456789

Tim

e (se

c)

Sum of lengths of pair proteins

01 02 03 04 05 06 07 08 09 1

CK34 dataset

CETMalign

CETMalign

CETMalign

CK34 dataset

0

5

10

15

20

25

0 01 02 03 04 05 06 07 08 09 1Ti

me (

sec)

Tim

e (se

c)Sum of lengths of pair proteins

RS119 dataset

CETMalign

00 01 02 03 04 05 06 07 08 09 1Product of lengths of pair proteinsProduct of lengths of pair proteins

RS119 dataset25

20

15

10

5

0

Figure 2 Comparison of pairwise PSCprocessing times usingTMalign andCEwith respect to the normalized sumof lengths andnormalizedproduct of lengths of pair proteins The dotted lines show the quadratic best fit for the data

the lengths of the proteins forming the pair USM requiresextremely small pairwise comparison times as shown inTable 1 and was thus excluded from this analysis The bestfit is achieved when using the product of lengths of the pairproteins as an attribute and is better than the fit achievedwhen using the sum of the lengths Moreover the quadraticcurve fitting results are better than the linear as shown inTable 4 For a given PSC method the relative times requiredfor completing a pairwise PSC task depend on the proper-ties of the pairs participating However the absolute timerequired for completing a pairwise PSC task depends also onthe complexity of the method used Since the complexity ofone pairwise PSC method as compared to another cannot be

known a priori it cannot be used as a factor for a generic jobspartitioning scheme for load balancing purposes

33 Speedup and Efficiency Analysis of Multiple Criteria PSCon the SCC NoC CPU In this work we experimented withboth round-robin and job partitioning schemes for loadbalancing Given the close correlation between lengths of pairproteins and processing timeswe used this as criteria for bothRound-robin [44] is a dynamic load balancing method inwhich the master process maintains a list of jobs and handsthe next job in the list to the next free slave process It isworth noting that if the list is presorted based on attributesthat are known to be proportional to the processing times this

BioMed Research International 7

Table 5 Comparison of speedup and efficiency achieved by the load balancing schemes in processing the all-to-all MCPSC task vis-a-vie thesingle core (SCC) processing times The efficiency is calculated assuming 47 processing elements (one PE serves as master) All times are inseconds

Load balancing scheme Dataset CK34 Dataset RS119Time Speedup Efficiency Time Speedup Efficiency

Serial (1 SCC core) 9698 mdash mdash 166000 mdash mdashRandom 479 20 043 4765 35 074Greedy partitioning (sum) 409 24 050 4300 39 082Greedy partitioning (product) 396 28 059 4433 38 080Round-robin (sum sorted) 239 41 086 3930 42 090Round-robin (product sorted) 238 41 086 3930 42 090

would naturally result in little to no idle times for the slaveprocesses For static job partitioning determining job-to-coremapping is equivalent to constructing119873 equal partitionsfor a list of elements which is an NP-hard problem [50]In our case 119873 is the number of cores and is equal to 47when all SCC cores are used We investigated methods forPSC jobs partitioning based on the sum and the product ofthe lengths of the two proteins to be compared We havestatically partitioned the PSC jobs using as attribute the sumof lengths (product of lengths) of the paired proteins to becompared The partitions are generated using the LongestProcessing Time (LPT) algorithm the jobs are sorted indescending order by sum of lengths (product of lengths)and then assigned to the partition with the smallest totalrunning sumThese partitions will be referred to as ldquoGreedyrdquopartitions fromnow on Since the job partitions are processedon different cores this strategy is expected to balance theprocessing times by reducing idle times

Figure 3 shows the partitions generated using a randompartitioning and a greedy partitioning scheme based on thesum of lengths and the product of lengths as attributes Parti-tions created with the random scheme with equal number ofpair proteins per partition vary greatly in terms of the totalnormalized sum of the sum (or product) of the pair proteinlengths The size of this normalized sum is expected to beproportional to the overall processing time each partitionwillrequire when assigned to a core Therefore if each partitionof PSC tasks created with the random scheme is processedon a different core several cores will have significant idletimes at the end On the contrary partitions created with thegreedy scheme are almost equal as to the normalized sum ofthe sum (or product) of the pair protein lengths Hence ifeach partition of PSC tasks created by the greedy scheme isprocessed on a different processing element (PE) idle timesare expected to be minimized

We extended the framework introduced in [36] by incor-porating more PSC methods (CE and USM) in additionto TMalign and computing consensus scores in order toperform Multicriteria PSC We retain the master-slaves pro-cessing model in the extended framework the first availablecore runs the master process and all other cores run slaveprocesses The master loads the data and generates a list ofjobs each one involving a single pairwise Protein StructureComparison operation to be performed using a specifiedPSC method Due to the vastly different processing times

0163248

0163248

0163248

0 2 4 6 8 10 12

Cor

e Id

Random split CK34 dataset

0 05 1 15 2 25 3 35 4

0 05 1 15 2 25 3 35 4

Cor

e Id

Greedy (sum) split CK34 dataset

Cor

e Id

Greedy (prod) split CK34 dataset

Figure 3 The 47 partitions created from list of PSC tasks sortedrandomly or by the sum (product) of lengths of pair proteins Eachhorizontal line represents the sum of the normalized lengths of theprotein pairs assigned to that partition

of PSC operations when using different methods we allowthe master process (a) to make a choice between sequentialversus parallel job processing and (b) implement a loadbalancing scheme for the slavesThe first allows the master tosequentially process jobs of a PSC method when distributingthem which would actually result in an increased processingtime The second allows the master to reduce idle times inslave processes which perform PSC continuously receivingjobs from and returning the results to the master till theyreceive a terminate signal

The consensus value for a protein pair is calculated usingthe scores from the three PSC methods (a) the USM-metricfrom USM (b) the TM-score from TMalign and (c) the RootMean Square Distance (RMSD) from CE The USM-metricand RMSD are distances and have a zero value for identicalstructures The TM-score is therefore inverted since it is theonly one of the three which is a similarity metric At the endthe average of the three normalized dissimilarity scores istaken as the MCPSC score for the protein pair

In Table 5 we compare the performance of the master-slaves setup with one master and 47 slaves to that of a singlecore of the SCC and show the speedup and the efficiencyachieved using the CK34 and RS119 datasets TMalign andCE jobs were run under the master-slaves model while USM

8 BioMed Research International

Table 6 Speedup efficiency and throughput in pairwise-PSC tasks per second for performing MCPSC on an Intel Core i7 multicore CPUusing the CK34 and RS119 datasets

Threads Dataset CK34 Dataset RS119Speedup Efficiency Throughput Speedup Efficiency Throughput

1 100 100 492 100 100 3492 188 094 927 146 073 5093 204 068 1001 198 066 6904 254 063 1247 260 065 9085 270 054 1325 270 054 9426 271 045 1332 271 045 9457 274 039 1347 269 038 9388 264 033 1298 267 033 931

jobs were processed by the master due to the negligible timerequired by these jobs

The greedy static partitioning scheme improves on theperformance of the random partitioning scheme but it isinadequate for the SCC where the cost of data transferbetween cores is low This scheme would be of interestin clusters of workstations where interconnection networksare slow The greedy partitioning scheme could also bebeneficial on larger many-core processors where the mastermay become a bottleneck making off-line creation of batchesof PSC tasks essential In such a scenario however othertechniques such as work-stealing would need to be assessedbefore selecting the most appropriate method

The round-robin dynamic job assignment strategy out-performs the best offline partitioning scheme Furthermorethe sorted round-robin flawlessly balances out the cores pro-cessing times on both datasets making it the best approachfor distributing jobs to the cores of the SCC In this approachthe protein pairs are presorted in descending order of product(or sum) of pair protein lengths Specifically for the one-to-many PSC case we simply retrieve the database proteins indescending order of length and create pairwise PSC taskswiththe query protein For the many-to-many PSC case howeverwe need to determine the correct order of the proteins whenthe query proteins are received

The sorted round-robin strategy distributes the jobs mostefficiently for the proposed master-slaves setup on the IntelSCC with 47 slaves This setup achieves a 42-fold (40-fold)speedup for the RS119 (CK34) dataset as compared to a singlecore of the SCCThe speedup is almost linear which suggeststhat high efficiency can be achieved evenwhen themany-coreprocessor has more nodes A bigger many-core processorhowever may require using a hierarchy of masters to avoida bottleneck on a single master node In such a scenarioa combination of partitioning schemes and round-robinjob assignment could be used to distribute and minimizeidle times For instance on larger NoC processors a well-balanced solution may require the use of clusters of process-ing elements concurrently processing PSC jobs with differentmethods All PEs computing jobs of a PSC method wouldreceive jobs from a specific submaster All such submasterswould in turn receive jobs from a common global master

34 Comparison of Multicriteria PSC Performance on Multi-core and Many-Core CPUs In order to perform this exper-iment the MCPSC software framework was retargeted tothe Intel Core i7 multicore processor using openmp [51] toimplement multithreading This experiment allowed us toassess how the MCPSC problem scales with increasing num-ber of cores on a modern multicore CPU readily availablefor scientists and engineers This multicore software versionuses shared memory to replace the RCCE based messagepassing (recv and send) calls used in the many-core versiondeveloped for the Intel SCC As shown in Table 6 we observespeedup on this multicore processor when running all-to-allMCPSCwith the CK34 and RS119 datasets up to the 4 threadsconfigurations Thereafter a steep speedup drop (efficiencyloss) is observed

This is attributed to the fact that this is a quad-coreprocessor that implements Hyper Threading (HT) [41] Sinceall threads operate on exactly the same type of instruc-tions workload (SPMD) it cannot take advantage of eachcorersquos super scalar architecture that needs varied workloadplacement (eg integer and floating point arithmetic at thesame time) to show performance advantages when utilizingmore threads than cores Further an SMP Operating System(OS) like the one we run on our test system uses allsystem cores as resources for swapping threads in and outA thread executing on core 1198751198641 for some number of cyclesmay get swapped out and later resume execution on core1198751198642 Thus the thread cache on core 1198751198641 becomes useless andit gets a lot of cache misses when restarting on 1198751198642 ThisOS behavior impacts cache performance and reduces theoverall application performance In terms of pairwise-PSCtasks per second (pps) the Intel Core i7 achieves a through-put of 1347 pps (945 pps) on the CK34 (RS119) dataset ascompared to 553 pps (363 pps) achieved by the SCC Thisis due to the fact that the multicore CPU contains latestgeneration cores featuring a highly superior out-of-ordermicroarchitecture and clocked at almost 75x the frequencyof the SCC We believe that even more performance can beexploited from next generation multicore processors shouldthey start introducing hardware memory structures like theMessage Passing Buffer (MPB) of the SCC NoC processoralongside their cores for increased communication efficiency

BioMed Research International 9

Table 7 Comparison of the SCC and i7 CPU in terms of throughputdelivered on all-to-all MCPSC

Dataset Pairs SCCthroughput

i7throughput Ratio

Skolnick 1089 605 2792 462Chew-Kedem 1156 503 1329 264Fischer 4624 198 937 474Rost-Sander 12996 330 938 285Lancia 72361 mdash 22265 mdashProteus 76729 mdash 677 mdash

among them without resorting to shared memory and itsunavoidable locks or cache-coherency protocol overheads

Finally Table 7 shows the comparative performance ofthe SCC (with 48 cores) and the i7 (running 7 threads)on all-to-all MCPSC using several datasets It can be seenthat the i7 outperforms the SCC consistently in terms ofthroughput The two larger datasets Lancia and Proteuswere not processed on the SCC because of the limited per-core memory (512MB) which was not sufficient to load thedomain structure data for the full dataset A decrease inthroughput was observed as the size of the dataset increaseswith some exceptions Both the SCC and the i7 deliver lowerthan expected throughput (based on its size) on the Fischerdataset which we believe is due to the higher complexity ofthe dataset As evidenced by Table 2 the Fischer dataset hasdomains belonging to 5 times more SCOP Super Families ascompared to CK34 and also has the second highest medianlength of protein domains among all the datasets used in thiswork Conversely the throughput delivered by the i7 on theLancia dataset is significantly higher than that delivered forthe similar size (in terms of number of domains) Proteusdataset We attribute this difference to the large differencebetween themean lengths of domains in each dataset as notedin Table 2

35 Qualitative Analysis of MCPSC Results Based on theChew-Kedem Dataset The acceleration capabilities offeredby modern processors (many-core and multicore) enablesperforming all-to-all PSC using different methods on large-scale datasets and compare results For example it becomeseasy to perform a qualitative analysis on a set of proteindomains and use consensus score (GPrsquoS) to explore relationbetween the protein domains categorize them into bio-logically relevant clusters [48] and automate generation ofdatabases such as Structural Classification of Proteins (SCOP)[46]

The MCPSC comparison method outperforms the com-ponent PSCmethods (TMalign CE andUSM) and groups 34representative proteins fromfivefold families into biologicallysignificant clusters In Figure 4 each protein domain is usingthe format ldquodomainName foldFamiliesrdquo The tags of theldquofoldFamiliesrdquo belong to one of (i) tb TIM barrel (ii) gglobins (iii) ab alpha beta (iv) b all beta and (v) a all alphaprotein families The clusters obtained from the MCPSCmethod show that most domains are grouped correctly

according to their structural fold In comparison all thecomponent PSCmethods produce clusters in which there aremany wrongly grouped domains Details of domain clusterallocation of all four methods are included in the Supple-mentary Materials file (see Supplementary Material availableonline at httpdxdoiorg1011552015563674) dom clustxlswith clusters generated by USM containing the highestnumber of errors The quantitative analysis evaluating theRecall versus Precision tradeoff based on the 119865-measure [52]results in a value of 119865 = 091 for the MCPSC method whichis higher than TMalign (119865 = 082) CE (119865 = 071) andUSM (119865 = 062) Finally in the Supplementary Materialfile supplementary materialpdf weprovide clusters generatedby hclust [47] for the four PSC methods (see Figures S1S2 and S3) PSC scores generated for all the datasets for allPSCmethods are included in the SupplementaryMaterial filepsc scoreszip

4 Conclusions and Future Work

In the near future we will see a continued increase inthe amount of cores integrated on a single chip as well asincreased availability of commodity many-core processorsThis trend is already visible today in that off-the-shelf PCscontain processors with up to 8 cores server grade machinesoften contain multiple processors with up to 32 cores andmulticore processors are even appearing in average grademobile devices Further the success of GPUs Tilerarsquos archi-tecture and Intelrsquos initiative for integrating more and morecores on a single chip also attest to this trend The ubiquityand the advanced core architectures employed by multicoreCPUsmake it imperative to build software that can efficientlyutilize their processing power Our experiments show that amodern Core i7 is able to deliver high throughput for all-to-all MCPSC Utilizing the performance gain delivered bymul-ticoreCPUbecomes especially importantwhen considered incombination with distributed setups such as clusters whichmay already contain nodes with multiple cores

Both multicore and many-core architectures lend them-selves to the popular and powerful MapReduce parallelprogramming model [53] If the underlying communicationlibrariesmdashtypically built on top of Message Passing Interface(MPI) [54]mdashare available this highly popular approachcan be applied to large-scale MCPSC in a manner similarto its application in other domains of bioinformatics [55]While some implementation of MapReduce is available formulticore machines it must be noted that no MapReduceframework implementations are available for the SCC Onepossible approach for using MapReduce in MCPSC wouldbe to (a) determine pairwise PSC to be performed on eachPE (pre-Map) (b) run all PSC methods on all pairs on eachPE in parallel (Map) (c) collect PSC scores from all PEsto a designated PE (shuffle) and (d) combine and generatefinal MCPSC scores (Reduce) Movement of data for taskdistribution (pre-Map) intermediate data transfer performedby the ldquoshufflerdquo mechanism and collection of results ifrequired by the Reduce step would benefit from low cost ofinter-PE communication Load balancing methods similar to

10 BioMed Research International

6xi

a_tb

1jh

g_a

1m

ba_g

1flp

_g1

ith_g

5m

bn_g

1m

yt_g

1ec

a_g

2hb

g_g

1ba

b_g

1as

h_g

3sd

h_g

2lh

b_g

1hl

b_g

1hl

m_g

2vh

b_g

1lh2

_g2

mnr

_tb

1ch

r_tb

4en

l_tb

5p2

1_a

b1

qra_

ab1

gnp_

ab6

q21

_ab

1aa9

_ab

1qf

o_b

1ne

u_b

1cd8

_b1

qa9

_b1

hnf_

b1

cdb_

b1

ci5

_b1

ct9

_ab

1cn

p_a

00

01

02

03

04

05

06

MCPSC

Figure 4 Hierarchical clustering result using the Chew-Kedem dataset and MCPSC score as distance metric between domains Each boxrepresents a cluster and the domains belonging to it The average linkage method was used to build the dendrogram

those discussed in this work could be useful in determiningeffective strategies for the distribution of PSC tasks (pre-Map) Analysis of MCPSC performance on multicore andmany-core processors withMapReduce would be required tounderstand the design trade-offs that would lead to efficientdesigns

Scalability limitation of the bus-based architecture cou-pled with the greater potential of Network-on-Chip basedprocessors to deliver efficiency and high performance impliesthat NoCs are likely to be used extensively in future many-core processors It is therefore important to start developingsoftware frameworks and solutions that capitalize on thisparallel architecture to meet the increasing computationaldemands in structural proteomics and bioinformatics ingeneral Comparison of many-core and multicore processorsin this work shows that the former can deliver higherefficiency than the latter processor architectures Furtherour experimental results make it clear that Intelrsquos SCC NoCmatches the speedup and efficiency achieved by a clusterof faster workstations [14] However a market ready NoCCPU will have a much better per watt performance ascompared to a cluster while also saving in space and infras-tructuremanagement costs Additionally any savings in wattsconsumed also reflect in savings in cooling infrastructurerequired for the hardware Furthermore with the per wattperformance being in focus for processor manufacturerssuch as Intel and Tilera this gap is set to expand evenfurther Finally many-core processors when used as the

CPUs of near-future clusters of workstations will providean additional level of local parallelism to applications whereperformance scalability is essential to tackle ever increasingcomputational demands As this study demonstrates thelikely increase in the availability and ubiquitousness of many-core processors the near linear speedup in tackling theMCPSC scenario and the ease of porting new PSC methodsto NoC based processors make many-core processors be ofgreat interest for the high performance structural proteomicsand bioinformatics communities in general

Incorporating additional PSC methods to a many-coreNoC processor is straightforward due to the familiar pro-gramming model and architecture of the individual coresOur rckskel library [36] offers convenience and parallel pro-gramming functions to further accelerate development on theNoC On future larger NoC processors optimal solutionsmayrequire the use of clusters of cores concurrently processingPSC jobs with different methods Using the rckskel librarywould facilitate the software development for such morecomplicated scenarios hiding low level details from the usersThe advantages offered by an algorithmic skeletons librarylike the rckskel make it of interest to further develop thelibrary To this end we intend to introduce communicationcontrols that will allow true blocking implementations ofldquosendrdquo and ldquorecvrdquo regardless of the communication backendlibrary used (RCCE eg provides busy-wait loops for block-ing) Such an implementation would increase the energy effi-ciency of applications built using the library Further work on

BioMed Research International 11

the librarywould involve separating the algorithmic skeletonscode from the communication subsystem so that it can beretargeted to other processors This would essentially requiresand-boxing the RCCE calls behindwell-structuredAPIs thatcan allow a different communication library to be plugged in

Further we would like to investigate the performancecharacteristics of all-to-all MCPSC on multicore and many-core processors with very large datasets Experimenting withsuch large datasets like the one used in [56] with more than6000 domains is likely to provide higher granularity andassist in more accurate assessment of the impact of differentload balancing schemes in terms of speedup delivered Addi-tionally cluster analysis on the results of such a large datasetwill provide a better measure of the biological relevance ofclusters generated by MCPSC A dataset of that size wouldgenerate more than 32 million pairwise comparison tasksper PSC method and keeping the number of methods to3 (as is the case in this work) would result in nearly 10million tasks Due to these large numbers processing sucha dataset specifically on a many-core processor would befeasible only with a larger NoC that is one which supportsmore memory and uses cores of newer architectures thanthe SCC Showing that such a large computation can becompleted in reasonable time would however further justifythe drive towards leveraging parallel processing capabilitiesof multicore and many-core processors

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Anuj Sharma contributed to the design of the study devel-oped the software framework and performed the experi-ments Elias S Manolakos conceived the study and con-tributed to the design and evaluation All authors contributedto the discussion of results andwrote and approved the paper

Acknowledgments

Mr Anuj Sharma and Dr Elias S Manolakos would like toacknowledge the support of this research by the EuropeanUnion (European Social FundESF) andGreek national fundsthrough the Operational Program ldquoEducation and LifelongLearningrdquo of the National Strategic Reference Framework(NSRF) Research Funding Program ldquoHeracleitus IIrdquo Theauthors would also like to acknowledge the MicroLab ECENational Technical University of Athens Greece for allowinguse of their SCC infrastructure In addition we thank MrDimitrios Rodopoulos and Professor Dimitrios Soudris ofNTUA for providing comments and assistance

References

[1] R A Laskowski J D Watson and J M Thornton ldquoProFunca server for predicting protein function from 3D structurerdquoNucleic Acids Research vol 33 no 2 pp 89ndash93 2005

[2] B Pang N Zhao M Becchi D Korkin and C-R ShyuldquoAccelerating large-scale protein structure alignments withgraphics processing unitsrdquo BMC Research Notes vol 5 no 1article 116 2012

[3] E Bielska X Lucas A Czerwiniec J M Kasprzak K HKaminska and J M Bujnicki ldquoVirtual screening strategies indrug design methods and applicationsrdquo Journal of Biotechnol-ogy Computational Biology and Bionanotechnology vol 92 no3 pp 249ndash264 2011

[4] H Kubinyi ldquoStructure-based design of enzyme inhibitors andreceptor ligandsrdquo Current Opinion in Drug Discovery andDevelopment vol 1 no 1 pp 4ndash15 1998

[5] G Lancia and S Istrail ldquoProtein structure comparison algo-rithms and applicationsrdquo in Mathematical Methods for ProteinStructure Analysis and Design vol 2666 of Lecture Notes inComputer Science pp 1ndash33 Springer Berlin Germany 2003

[6] I Eidhammer I Jonassen andW R Taylor ldquoStructure compar-ison and structure patternsrdquo Journal of Computational Biologyvol 7 no 5 pp 685ndash716 2000

[7] Y B Ruiz-Blanco W Paz J Green and Y Marrero-PonceldquoProtDCal a program to compute general-purpose-numericaldescriptors for sequences and 3D-structures of proteinsrdquo BMCBioinformatics vol 16 no 1 article 162 2015

[8] D Barthel J D Hirst J Błazewicz E K Burke and NKrasnogor ldquoProCKSI a decision support system for protein(structure) comparison knowledge similarity and informa-tionrdquo BMC Bioinformatics vol 8 article 416 2007

[9] M Arriagada and A Poleksic ldquoOn the difference in qualitybetween current heuristic and optimal solutions to the proteinstructure alignment problemrdquo BioMed Research Internationalvol 2013 Article ID 459248 8 pages 2013

[10] D E Robillard P T Mpangase S Hazelhurst and F DehneldquoSpeeDB fast structural protein searchesrdquo Bioinformatics vol31 no 18 pp 3027ndash3034 2015

[11] M Veeramalai D Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 no 1 article 138 2010

[12] S B Pandit and J Skolnick ldquoFr-TM-align a new proteinstructural alignment method based on fragment alignmentsand the TM-scorerdquo BMC Bioinformatics vol 9 no 1 article 5312008

[13] L Mavridis and D Ritchie ldquo3D-blast 3D protein structurealignment comparison and classification using spherical polarFourier correlationsrdquo in Proceedings of the Pacific Symposium onBiocomputing pp 281ndash292 World Scientific PublishingKamuela Hawaii USA January 2010 httphalinriafrinria-00434263en

[14] A A Shah G Folino and N Krasnogor ldquoToward high-throughput multicriteria protein-structure comparison andanalysisrdquo IEEE Transactions on Nanobioscience vol 9 no 2 pp144ndash155 2010

[15] L I Kuncheva and C J Whitaker ldquoMeasures of diversity inclassifier ensembles and their relationship with the ensembleaccuracyrdquo Machine Learning vol 51 no 2 pp 181ndash207 2003

[16] P Sollich and A Krogh ldquoLearning with ensembles howoverfitting can be usefulrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo96) vol 8 pp190ndash196 1996

12 BioMed Research International

[17] G Brown J Wyatt R Harris and X Yao ldquoDiversity creationmethods a survey and categorisationrdquo Information Fusion vol6 no 1 pp 5ndash20 2005

[18] P Chi Efficient protein tertiary structure retrievals and clas-sifications using content based comparison algorithms [PhDdissertation] University of Missouri at Columbia ColumbiaMo USA 2007

[19] A Poleksic ldquoAlgorithms for optimal protein structure align-mentrdquo Bioinformatics vol 25 no 21 pp 2751ndash2756 2009

[20] A A Shah Studies on distributed approaches for large scalemulti-criteria protein structure comparison and analysis [PhDthesis] University of Nottingham Nottingham UK 2011httpeprintsnottinghamacuk11735

[21] M Pharr and R Fernando GPU Gems 2 Programming Tech-niques for High-Performance Graphics and General-PurposeComputation Pearson Eduction Addison-Wesley Professional2005

[22] M Azimi N Cherukuri D Jayashima et al ldquoIntegrationchallenges and tradeoffs for tera-scale architecturesrdquo Intel Tech-nology Journal vol 11 pp 173ndash184 2007

[23] S Sarkar T Majumder A Kalyanaraman and P P PandeldquoHardware accelerators for biocomputing a surveyrdquo in Pro-ceedings of the IEEE International Symposium on Circuits andSystems (ISCAS rsquo10) pp 3789ndash3792 IEEE Paris France June2010

[24] E Kouskoumvekakis D Soudris and E S Manolakos ldquoMany-core CPUs can deliver scalable performance to stochasticsimulations of large-scale biochemical reaction networksrdquo inProceedings of the International Conference onHigh PerformanceComputing amp Simulation (HPCS rsquo15) pp 517ndash524 IEEE Ams-terdam The Netherlands July 2015

[25] A A Shah D Barthel and N Krasnogor ldquoGrid and distributedpublic computing schemes for structural proteomics a shortoverviewrdquo in Frontiers of High Performance Computing andNetworking ISPA 2007 Workshops vol 4743 pp 424ndash434Springer Berlin Germany 2007

[26] E Totoni B Behzad S Ghike and J Torrellas ldquoComparing thepower and performance of Intelrsquos SCC to state-of-the-art CPUsand GPUsrdquo in Proceedings of the IEEE International Symposiumon Performance Analysis of Systems amp Software (ISPASS rsquo12) pp78ndash87 IEEE Computer Society New Brunswick NJ USA April2012

[27] T Bjerregaard and S Mahadevan ldquoA survey of research andpractices of network-on-chiprdquoACMComputing Surveys vol 38no 1 article 1 Article ID 1132953 2006

[28] B Marker E Chan J Poulson et al ldquoProgramming many-corearchitecturesmdasha case study dense matrix computations on theIntel single-chip cloud computer processorrdquo Concurrency andComputation Practice and Experience 2011

[29] G Blake R G Dreslinski and TMudge ldquoA survey of multicoreprocessorsrdquo IEEE Signal Processing Magazine vol 26 no 6 pp26ndash37 2009

[30] P Guerrier and A Greiner ldquoA generic architecture for on-chip packet-switched interconnectionsrdquo in Proceedings of theConference on Design Automation and Test in Europe (DATErsquo00) pp 250ndash256 ACM Paris France 2000

[31] D Atienza F Angiolini S Murali A Pullini L Benini and GDe Micheli ldquoNetwork-on-chip design and synthesis outlookrdquoIntegration vol 41 no 2 pp 1ndash35 2008

[32] R P Mohanty A K Turuk and B Sahoo ldquoPerformanceevaluation of multi-core processors with varied interconnectnetworksrdquo in Proceedings of the 2nd International Conference onAdvanced Computing Networking and Security (ADCONS rsquo13)pp 7ndash11 IEEE Computer Society Mangalore India December2013

[33] S Isaza Multicore architectures for bioinformatics applications[PhD thesis] University of Lugano Lugano Switzerland 2011

[34] S B Needleman and C D Wunsch ldquoA general method appli-cable to the search for similarities in the amino acid sequenceof two proteinsrdquo Journal of Molecular Biology vol 48 no 3 pp443ndash453 1970

[35] S Sarkar G R Kulkarni P P Pande and A Kalyanara-man ldquoNetwork-on-chip hardware accelerators for biologicalsequence alignmentrdquo IEEE Transactions on Computers vol 59no 1 pp 29ndash41 2010

[36] A Sharma A Papanikolaou and E SManolakos ldquoAcceleratingall-to-all protein structures comparison with TMalign usinga NoC many-cores processor architecturerdquo in Proceedings ofthe 27th IEEE International Parallel and Distributed ProcessingSymposium Workshops amp PhD Forum (IPDPSW rsquo13) pp 510ndash519 IEEE Cambridge Mass USA May 2013

[37] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[38] N Krasnogor and D A Pelta ldquoMeasuring the similarity ofprotein structures by means of the universal similarity metricrdquoBioinformatics vol 20 no 7 pp 1015ndash1021 2004

[39] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[40] N Melot K Avdic C Kessler and J Keller ldquoInvestigationof main memory bandwidth on Intel Single-Chip Cloudcomputerrdquo in Proceedings of the 3rd Many-Core ApplicationsResearch Community Symposium (MARC rsquo11) pp 107ndash110Ettlingen Germany July 2011

[41] S Saini H Jin R Hood D Barker P Mehrotra and RBiswas ldquoThe impact of hyper-threading on processor resourceutilization in production applicationsrdquo in Proceedings of the18th International Conference on High Performance Computing(HiPC rsquo11) pp 1ndash10 Bangalore India December 2011

[42] H Gonzalez-Velez and M Leyton ldquoA survey of algorithmicskeleton frameworks high-level structured parallel program-ming enablersrdquo Software Practice and Experience vol 40 no12 pp 1135ndash1160 2010

[43] D K G Campbell ldquoTowards the classification of algorithmicskeletonsrdquo Tech Rep YCS 276 1996

[44] A Silberschatz P B Galvin and G Gagne Operating SystemConcepts John Wiley amp Sons Wiley Publishing 8th edition2008

[45] D M Tax and R P W Duin ldquoFeature scaling in support vectordata descriptionsrdquo in Learning from Imbalanced Datasets NJapkowicz Ed pp 25ndash30 AAAI Press Menlo Park Calif USA2000

[46] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

BioMed Research International 13

[47] R Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing ViennaAustria 2015 httpwwwr-projectorg

[48] M Veeramalai D R Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 article 138 2010

[49] G Maccari G L B Spampinato and V Tozzini ldquoSecStAnTsecondary structure analysis tool for data selection statisticsand models buildingrdquo Bioinformatics vol 30 no 5 pp 668ndash674 2014

[50] S Mertens ldquoThe easiest hard problem number partitioningrdquoin Computational Complexity and Statistical Physics A PercusG Istrate and C Moore Eds pp 125ndash139 Oxford UniversityPress New York NY USA 2006

[51] OpenMP API for parallel programming version 40 httpopenmporgwp

[52] J Handl J D Knowles and D B Kell ldquoComputational clustervalidation in post-genomic data analysisrdquo Bioinformatics vol21 no 15 pp 3201ndash3212 2005

[53] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[54] M P Forum ldquoMPI amessage-passing interface standardrdquo TechRep University of Tennessee Knoxville Tenn USA 1994

[55] Q Zou X-B Li W-R Jiang Z-Y Lin G-L Li and K ChenldquoSurvey of MapReduce frame operation in bioinformaticsrdquoBriefings in Bioinformatics 2013

[56] N Malod-Dognin and N Przulj ldquoGR-Align fast and flexiblealignment of protein 3D structures using graphlet degreesimilarityrdquo Bioinformatics vol 30 no 9 pp 1259ndash1265 2014

[57] W Xie and N V Sahinidis ldquoA reduction-based exact algorithmfor the contact map overlap problemrdquo Journal of ComputationalBiology vol 14 no 5 pp 637ndash654 2007

[58] L P Chew and K Kedem ldquoFinding the consensus shape for aprotein family (extended abstract)rdquo in Proceedings of the 18thACM Annual Symposium on Computational Geometry (SCGrsquo02) pp 64ndash73 ACM Barcelona Spain June 2002

[59] D Fischer A Elofsson D Rice and D Eisenberg ldquoAssessingthe performance of fold recognition methods by means ofa comprehensive benchmarkrdquo in Proceedings of the PacificSymposium on Biocomputing pp 300ndash318 January 1996

[60] B Rost and C Sander ldquoPrediction of protein secondary struc-ture at better than 70 accuracyrdquo Journal of Molecular Biologyvol 232 no 2 pp 584ndash599 1993

[61] A Caprara R D Carr S Istrail G Lancia and B Walenzldquo1001 Optimal PDB structure alignments integer programmingmethods for finding the maximum contact map overlaprdquoJournal of Computational Biology vol 11 no 1 pp 27ndash52 2004

[62] R Andonov N Yanev and N Malod-Dognin ldquoAn efficientlagrangian relaxation for the contact map overlap problemrdquo inAlgorithms in Bioinformatics 8th InternationalWorkshopWABI2008 Karlsruhe Germany September 15ndash19 2008 ProceedingsK A Crandall and J Lagergren Eds vol 5251 of Lecture Notesin Computer Science pp 162ndash173 Springer Berlin Germany2008

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 5: Research Article Efficient Multicriteria Protein Structure ...mark MPBT data. Legacy soware by default does not mark memory as MPBT and runs without modi cations. If the bit is set,

BioMed Research International 5

Table 2 Basic statistics of the domains in the datasets used in this work The table includes the number of SCOP families superfamilies(SpFams) and folds as well as the total number of domains (No) the Minimum (Min) Maximum (Max) Median and Mean and StandardDeviation (Std) of the domain lengths in each dataset

Dataset No Domain lengths SCOP classificationsMin Max Median Mean Std Families SpFams Folds

Skolnick [57] 33 97 255 158 1677 627 5 5 5Chew-Kedem [58] 34 90 497 147 1795 1001 10 9 9Fischer [59] 68 62 581 181 2206 1256 56 44 40Rost-Sander [60] 114 21 753 167 1932 1234 93 85 71Lancia [61] 269 64 72 68 679 24 79 72 57Proteus [62] 277 64 728 239 2476 1167 53 47 41

Table 3 Time required for the baseline all-to-all PSC task using the TMalign CE and USM PSC methods on the PC and a single core of theSCC The table also shows the time required for the all-to-all MCPSC task (where all three PSC methods are used) All times are in seconds

MethodPorted software on PC Ported software on SCC 1 core

CK34 RS119 CK34 RS119Load Processing Load Processing Load Processing Load Processing

TMalign 001 127 005 1725 030 2514 1 33452CE 080 374 3 6459 14 7152 50 132205USM 001 060 004 7 070 30 2 345All 080 502 3 8191 15 9697 53 166000

using (2) where 119862 is the number of clusters (119862 = 5 in thiswork) as described in [48]

True Positives (TP)

=

119862

sum

119894=1

number of correctly assigned domains

False Positives (FP)

=

119862

sum

119894=1

number of incorrectly assigned domains

False Negatives (FN)

=

119862

sum

119894=1

number of missing domains

Precision (P) = TPTP + FP

Recall (R) = TPTP + FN

F-measure (F) = 2 lowast 119875 lowast 119877119875 + 119877

(2)

3 Results and Discussion

31 Baseline All-to-All PSC Table 3 provides the times takento perform all-to-all comparison on the PC and on a singlecore of the SCC with the CK34 and RS119 datasets Thesoftware took longer to load data and process jobs on a singlecore of the SCC as compared to the PC The longer data

loading times are due to the need for network IO since thedata is stored on the Management Console PC (MCPC) andaccessed by the SCC via NFS In long running services how-ever data is loaded once and used when needed thereforethis time is disregarded in our performance comparisonsThedifferences in processing times are due to the architecturedifference of the processor cores (times86-64 versus times86) andtheir operating frequencies (3GHz versus 533MHz) on thePC and a single SCC core respectively

We performed a statistical analysis using SecStAnT [49]of the proteins participating in the 10 slowest and the 10 fastestpairwise PSC tasks (using TMalign and CE) using the CK34and RS119 datasetsThe parameters on which SecStAnT com-pares proteins are their secondary structure characteristicsBond Angle Dihedral Angle Theta and Psi Results of the 1-parameter and 2-parameter statistical analysis distributionsand correlations did not show any significant statistical differ-ence between the slowest and fastest PSC domains Howeverthe slowest pairs contain the Hlx310Alpha supersecondarystructure which is absent from the fastest pairs The 310-helixoccurs rarely in protein domains because the tight packingof the backbone atoms makes it energetically unstable Thisstructure might be responsible for the increased alignmenttimes in certain domains Further we compared the proteinsusing their UniPort entries and observed that most proteinsin the slowest set are Cytoplasmic while most members inthe fastest set are membrane proteins

32 Relation of Expected PSC Times to the Lengths of theProteins underComparison Comparison of the pairwise PSCprocessing with respect to the protein lengths as shown inFigure 2 reveals some interesting facts the time requiredfor completing a pairwise PSC task is clearly a function of

6 BioMed Research International

Table 4 Sum of squares of residuals after curve fitting for the different PSC methods The table shows that the sum of residuals is lower withquadratic fit (square) as compared to linear fit All values are multiplied with 1119890 + 06

PSC methodResiduals (sum of lengths) Residuals (product of lengths)

CK34 RS119 CK34 RS119Linear Square Linear Square Linear Square Linear Square

TMalign 4 1 100 90 1 1 40 30CE 200 100 4000 1000 100 50 2000 800

0123456789

01 02 03 04 05 06 07 08 09 1

Tim

e (se

c)

0123456789

Tim

e (se

c)

Sum of lengths of pair proteins

01 02 03 04 05 06 07 08 09 1

CK34 dataset

CETMalign

CETMalign

CETMalign

CK34 dataset

0

5

10

15

20

25

0 01 02 03 04 05 06 07 08 09 1Ti

me (

sec)

Tim

e (se

c)Sum of lengths of pair proteins

RS119 dataset

CETMalign

00 01 02 03 04 05 06 07 08 09 1Product of lengths of pair proteinsProduct of lengths of pair proteins

RS119 dataset25

20

15

10

5

0

Figure 2 Comparison of pairwise PSCprocessing times usingTMalign andCEwith respect to the normalized sumof lengths andnormalizedproduct of lengths of pair proteins The dotted lines show the quadratic best fit for the data

the lengths of the proteins forming the pair USM requiresextremely small pairwise comparison times as shown inTable 1 and was thus excluded from this analysis The bestfit is achieved when using the product of lengths of the pairproteins as an attribute and is better than the fit achievedwhen using the sum of the lengths Moreover the quadraticcurve fitting results are better than the linear as shown inTable 4 For a given PSC method the relative times requiredfor completing a pairwise PSC task depend on the proper-ties of the pairs participating However the absolute timerequired for completing a pairwise PSC task depends also onthe complexity of the method used Since the complexity ofone pairwise PSC method as compared to another cannot be

known a priori it cannot be used as a factor for a generic jobspartitioning scheme for load balancing purposes

33 Speedup and Efficiency Analysis of Multiple Criteria PSCon the SCC NoC CPU In this work we experimented withboth round-robin and job partitioning schemes for loadbalancing Given the close correlation between lengths of pairproteins and processing timeswe used this as criteria for bothRound-robin [44] is a dynamic load balancing method inwhich the master process maintains a list of jobs and handsthe next job in the list to the next free slave process It isworth noting that if the list is presorted based on attributesthat are known to be proportional to the processing times this

BioMed Research International 7

Table 5 Comparison of speedup and efficiency achieved by the load balancing schemes in processing the all-to-all MCPSC task vis-a-vie thesingle core (SCC) processing times The efficiency is calculated assuming 47 processing elements (one PE serves as master) All times are inseconds

Load balancing scheme Dataset CK34 Dataset RS119Time Speedup Efficiency Time Speedup Efficiency

Serial (1 SCC core) 9698 mdash mdash 166000 mdash mdashRandom 479 20 043 4765 35 074Greedy partitioning (sum) 409 24 050 4300 39 082Greedy partitioning (product) 396 28 059 4433 38 080Round-robin (sum sorted) 239 41 086 3930 42 090Round-robin (product sorted) 238 41 086 3930 42 090

would naturally result in little to no idle times for the slaveprocesses For static job partitioning determining job-to-coremapping is equivalent to constructing119873 equal partitionsfor a list of elements which is an NP-hard problem [50]In our case 119873 is the number of cores and is equal to 47when all SCC cores are used We investigated methods forPSC jobs partitioning based on the sum and the product ofthe lengths of the two proteins to be compared We havestatically partitioned the PSC jobs using as attribute the sumof lengths (product of lengths) of the paired proteins to becompared The partitions are generated using the LongestProcessing Time (LPT) algorithm the jobs are sorted indescending order by sum of lengths (product of lengths)and then assigned to the partition with the smallest totalrunning sumThese partitions will be referred to as ldquoGreedyrdquopartitions fromnow on Since the job partitions are processedon different cores this strategy is expected to balance theprocessing times by reducing idle times

Figure 3 shows the partitions generated using a randompartitioning and a greedy partitioning scheme based on thesum of lengths and the product of lengths as attributes Parti-tions created with the random scheme with equal number ofpair proteins per partition vary greatly in terms of the totalnormalized sum of the sum (or product) of the pair proteinlengths The size of this normalized sum is expected to beproportional to the overall processing time each partitionwillrequire when assigned to a core Therefore if each partitionof PSC tasks created with the random scheme is processedon a different core several cores will have significant idletimes at the end On the contrary partitions created with thegreedy scheme are almost equal as to the normalized sum ofthe sum (or product) of the pair protein lengths Hence ifeach partition of PSC tasks created by the greedy scheme isprocessed on a different processing element (PE) idle timesare expected to be minimized

We extended the framework introduced in [36] by incor-porating more PSC methods (CE and USM) in additionto TMalign and computing consensus scores in order toperform Multicriteria PSC We retain the master-slaves pro-cessing model in the extended framework the first availablecore runs the master process and all other cores run slaveprocesses The master loads the data and generates a list ofjobs each one involving a single pairwise Protein StructureComparison operation to be performed using a specifiedPSC method Due to the vastly different processing times

0163248

0163248

0163248

0 2 4 6 8 10 12

Cor

e Id

Random split CK34 dataset

0 05 1 15 2 25 3 35 4

0 05 1 15 2 25 3 35 4

Cor

e Id

Greedy (sum) split CK34 dataset

Cor

e Id

Greedy (prod) split CK34 dataset

Figure 3 The 47 partitions created from list of PSC tasks sortedrandomly or by the sum (product) of lengths of pair proteins Eachhorizontal line represents the sum of the normalized lengths of theprotein pairs assigned to that partition

of PSC operations when using different methods we allowthe master process (a) to make a choice between sequentialversus parallel job processing and (b) implement a loadbalancing scheme for the slavesThe first allows the master tosequentially process jobs of a PSC method when distributingthem which would actually result in an increased processingtime The second allows the master to reduce idle times inslave processes which perform PSC continuously receivingjobs from and returning the results to the master till theyreceive a terminate signal

The consensus value for a protein pair is calculated usingthe scores from the three PSC methods (a) the USM-metricfrom USM (b) the TM-score from TMalign and (c) the RootMean Square Distance (RMSD) from CE The USM-metricand RMSD are distances and have a zero value for identicalstructures The TM-score is therefore inverted since it is theonly one of the three which is a similarity metric At the endthe average of the three normalized dissimilarity scores istaken as the MCPSC score for the protein pair

In Table 5 we compare the performance of the master-slaves setup with one master and 47 slaves to that of a singlecore of the SCC and show the speedup and the efficiencyachieved using the CK34 and RS119 datasets TMalign andCE jobs were run under the master-slaves model while USM

8 BioMed Research International

Table 6 Speedup efficiency and throughput in pairwise-PSC tasks per second for performing MCPSC on an Intel Core i7 multicore CPUusing the CK34 and RS119 datasets

Threads Dataset CK34 Dataset RS119Speedup Efficiency Throughput Speedup Efficiency Throughput

1 100 100 492 100 100 3492 188 094 927 146 073 5093 204 068 1001 198 066 6904 254 063 1247 260 065 9085 270 054 1325 270 054 9426 271 045 1332 271 045 9457 274 039 1347 269 038 9388 264 033 1298 267 033 931

jobs were processed by the master due to the negligible timerequired by these jobs

The greedy static partitioning scheme improves on theperformance of the random partitioning scheme but it isinadequate for the SCC where the cost of data transferbetween cores is low This scheme would be of interestin clusters of workstations where interconnection networksare slow The greedy partitioning scheme could also bebeneficial on larger many-core processors where the mastermay become a bottleneck making off-line creation of batchesof PSC tasks essential In such a scenario however othertechniques such as work-stealing would need to be assessedbefore selecting the most appropriate method

The round-robin dynamic job assignment strategy out-performs the best offline partitioning scheme Furthermorethe sorted round-robin flawlessly balances out the cores pro-cessing times on both datasets making it the best approachfor distributing jobs to the cores of the SCC In this approachthe protein pairs are presorted in descending order of product(or sum) of pair protein lengths Specifically for the one-to-many PSC case we simply retrieve the database proteins indescending order of length and create pairwise PSC taskswiththe query protein For the many-to-many PSC case howeverwe need to determine the correct order of the proteins whenthe query proteins are received

The sorted round-robin strategy distributes the jobs mostefficiently for the proposed master-slaves setup on the IntelSCC with 47 slaves This setup achieves a 42-fold (40-fold)speedup for the RS119 (CK34) dataset as compared to a singlecore of the SCCThe speedup is almost linear which suggeststhat high efficiency can be achieved evenwhen themany-coreprocessor has more nodes A bigger many-core processorhowever may require using a hierarchy of masters to avoida bottleneck on a single master node In such a scenarioa combination of partitioning schemes and round-robinjob assignment could be used to distribute and minimizeidle times For instance on larger NoC processors a well-balanced solution may require the use of clusters of process-ing elements concurrently processing PSC jobs with differentmethods All PEs computing jobs of a PSC method wouldreceive jobs from a specific submaster All such submasterswould in turn receive jobs from a common global master

34 Comparison of Multicriteria PSC Performance on Multi-core and Many-Core CPUs In order to perform this exper-iment the MCPSC software framework was retargeted tothe Intel Core i7 multicore processor using openmp [51] toimplement multithreading This experiment allowed us toassess how the MCPSC problem scales with increasing num-ber of cores on a modern multicore CPU readily availablefor scientists and engineers This multicore software versionuses shared memory to replace the RCCE based messagepassing (recv and send) calls used in the many-core versiondeveloped for the Intel SCC As shown in Table 6 we observespeedup on this multicore processor when running all-to-allMCPSCwith the CK34 and RS119 datasets up to the 4 threadsconfigurations Thereafter a steep speedup drop (efficiencyloss) is observed

This is attributed to the fact that this is a quad-coreprocessor that implements Hyper Threading (HT) [41] Sinceall threads operate on exactly the same type of instruc-tions workload (SPMD) it cannot take advantage of eachcorersquos super scalar architecture that needs varied workloadplacement (eg integer and floating point arithmetic at thesame time) to show performance advantages when utilizingmore threads than cores Further an SMP Operating System(OS) like the one we run on our test system uses allsystem cores as resources for swapping threads in and outA thread executing on core 1198751198641 for some number of cyclesmay get swapped out and later resume execution on core1198751198642 Thus the thread cache on core 1198751198641 becomes useless andit gets a lot of cache misses when restarting on 1198751198642 ThisOS behavior impacts cache performance and reduces theoverall application performance In terms of pairwise-PSCtasks per second (pps) the Intel Core i7 achieves a through-put of 1347 pps (945 pps) on the CK34 (RS119) dataset ascompared to 553 pps (363 pps) achieved by the SCC Thisis due to the fact that the multicore CPU contains latestgeneration cores featuring a highly superior out-of-ordermicroarchitecture and clocked at almost 75x the frequencyof the SCC We believe that even more performance can beexploited from next generation multicore processors shouldthey start introducing hardware memory structures like theMessage Passing Buffer (MPB) of the SCC NoC processoralongside their cores for increased communication efficiency

BioMed Research International 9

Table 7 Comparison of the SCC and i7 CPU in terms of throughputdelivered on all-to-all MCPSC

Dataset Pairs SCCthroughput

i7throughput Ratio

Skolnick 1089 605 2792 462Chew-Kedem 1156 503 1329 264Fischer 4624 198 937 474Rost-Sander 12996 330 938 285Lancia 72361 mdash 22265 mdashProteus 76729 mdash 677 mdash

among them without resorting to shared memory and itsunavoidable locks or cache-coherency protocol overheads

Finally Table 7 shows the comparative performance ofthe SCC (with 48 cores) and the i7 (running 7 threads)on all-to-all MCPSC using several datasets It can be seenthat the i7 outperforms the SCC consistently in terms ofthroughput The two larger datasets Lancia and Proteuswere not processed on the SCC because of the limited per-core memory (512MB) which was not sufficient to load thedomain structure data for the full dataset A decrease inthroughput was observed as the size of the dataset increaseswith some exceptions Both the SCC and the i7 deliver lowerthan expected throughput (based on its size) on the Fischerdataset which we believe is due to the higher complexity ofthe dataset As evidenced by Table 2 the Fischer dataset hasdomains belonging to 5 times more SCOP Super Families ascompared to CK34 and also has the second highest medianlength of protein domains among all the datasets used in thiswork Conversely the throughput delivered by the i7 on theLancia dataset is significantly higher than that delivered forthe similar size (in terms of number of domains) Proteusdataset We attribute this difference to the large differencebetween themean lengths of domains in each dataset as notedin Table 2

35 Qualitative Analysis of MCPSC Results Based on theChew-Kedem Dataset The acceleration capabilities offeredby modern processors (many-core and multicore) enablesperforming all-to-all PSC using different methods on large-scale datasets and compare results For example it becomeseasy to perform a qualitative analysis on a set of proteindomains and use consensus score (GPrsquoS) to explore relationbetween the protein domains categorize them into bio-logically relevant clusters [48] and automate generation ofdatabases such as Structural Classification of Proteins (SCOP)[46]

The MCPSC comparison method outperforms the com-ponent PSCmethods (TMalign CE andUSM) and groups 34representative proteins fromfivefold families into biologicallysignificant clusters In Figure 4 each protein domain is usingthe format ldquodomainName foldFamiliesrdquo The tags of theldquofoldFamiliesrdquo belong to one of (i) tb TIM barrel (ii) gglobins (iii) ab alpha beta (iv) b all beta and (v) a all alphaprotein families The clusters obtained from the MCPSCmethod show that most domains are grouped correctly

according to their structural fold In comparison all thecomponent PSCmethods produce clusters in which there aremany wrongly grouped domains Details of domain clusterallocation of all four methods are included in the Supple-mentary Materials file (see Supplementary Material availableonline at httpdxdoiorg1011552015563674) dom clustxlswith clusters generated by USM containing the highestnumber of errors The quantitative analysis evaluating theRecall versus Precision tradeoff based on the 119865-measure [52]results in a value of 119865 = 091 for the MCPSC method whichis higher than TMalign (119865 = 082) CE (119865 = 071) andUSM (119865 = 062) Finally in the Supplementary Materialfile supplementary materialpdf weprovide clusters generatedby hclust [47] for the four PSC methods (see Figures S1S2 and S3) PSC scores generated for all the datasets for allPSCmethods are included in the SupplementaryMaterial filepsc scoreszip

4 Conclusions and Future Work

In the near future we will see a continued increase inthe amount of cores integrated on a single chip as well asincreased availability of commodity many-core processorsThis trend is already visible today in that off-the-shelf PCscontain processors with up to 8 cores server grade machinesoften contain multiple processors with up to 32 cores andmulticore processors are even appearing in average grademobile devices Further the success of GPUs Tilerarsquos archi-tecture and Intelrsquos initiative for integrating more and morecores on a single chip also attest to this trend The ubiquityand the advanced core architectures employed by multicoreCPUsmake it imperative to build software that can efficientlyutilize their processing power Our experiments show that amodern Core i7 is able to deliver high throughput for all-to-all MCPSC Utilizing the performance gain delivered bymul-ticoreCPUbecomes especially importantwhen considered incombination with distributed setups such as clusters whichmay already contain nodes with multiple cores

Both multicore and many-core architectures lend them-selves to the popular and powerful MapReduce parallelprogramming model [53] If the underlying communicationlibrariesmdashtypically built on top of Message Passing Interface(MPI) [54]mdashare available this highly popular approachcan be applied to large-scale MCPSC in a manner similarto its application in other domains of bioinformatics [55]While some implementation of MapReduce is available formulticore machines it must be noted that no MapReduceframework implementations are available for the SCC Onepossible approach for using MapReduce in MCPSC wouldbe to (a) determine pairwise PSC to be performed on eachPE (pre-Map) (b) run all PSC methods on all pairs on eachPE in parallel (Map) (c) collect PSC scores from all PEsto a designated PE (shuffle) and (d) combine and generatefinal MCPSC scores (Reduce) Movement of data for taskdistribution (pre-Map) intermediate data transfer performedby the ldquoshufflerdquo mechanism and collection of results ifrequired by the Reduce step would benefit from low cost ofinter-PE communication Load balancing methods similar to

10 BioMed Research International

6xi

a_tb

1jh

g_a

1m

ba_g

1flp

_g1

ith_g

5m

bn_g

1m

yt_g

1ec

a_g

2hb

g_g

1ba

b_g

1as

h_g

3sd

h_g

2lh

b_g

1hl

b_g

1hl

m_g

2vh

b_g

1lh2

_g2

mnr

_tb

1ch

r_tb

4en

l_tb

5p2

1_a

b1

qra_

ab1

gnp_

ab6

q21

_ab

1aa9

_ab

1qf

o_b

1ne

u_b

1cd8

_b1

qa9

_b1

hnf_

b1

cdb_

b1

ci5

_b1

ct9

_ab

1cn

p_a

00

01

02

03

04

05

06

MCPSC

Figure 4 Hierarchical clustering result using the Chew-Kedem dataset and MCPSC score as distance metric between domains Each boxrepresents a cluster and the domains belonging to it The average linkage method was used to build the dendrogram

those discussed in this work could be useful in determiningeffective strategies for the distribution of PSC tasks (pre-Map) Analysis of MCPSC performance on multicore andmany-core processors withMapReduce would be required tounderstand the design trade-offs that would lead to efficientdesigns

Scalability limitation of the bus-based architecture cou-pled with the greater potential of Network-on-Chip basedprocessors to deliver efficiency and high performance impliesthat NoCs are likely to be used extensively in future many-core processors It is therefore important to start developingsoftware frameworks and solutions that capitalize on thisparallel architecture to meet the increasing computationaldemands in structural proteomics and bioinformatics ingeneral Comparison of many-core and multicore processorsin this work shows that the former can deliver higherefficiency than the latter processor architectures Furtherour experimental results make it clear that Intelrsquos SCC NoCmatches the speedup and efficiency achieved by a clusterof faster workstations [14] However a market ready NoCCPU will have a much better per watt performance ascompared to a cluster while also saving in space and infras-tructuremanagement costs Additionally any savings in wattsconsumed also reflect in savings in cooling infrastructurerequired for the hardware Furthermore with the per wattperformance being in focus for processor manufacturerssuch as Intel and Tilera this gap is set to expand evenfurther Finally many-core processors when used as the

CPUs of near-future clusters of workstations will providean additional level of local parallelism to applications whereperformance scalability is essential to tackle ever increasingcomputational demands As this study demonstrates thelikely increase in the availability and ubiquitousness of many-core processors the near linear speedup in tackling theMCPSC scenario and the ease of porting new PSC methodsto NoC based processors make many-core processors be ofgreat interest for the high performance structural proteomicsand bioinformatics communities in general

Incorporating additional PSC methods to a many-coreNoC processor is straightforward due to the familiar pro-gramming model and architecture of the individual coresOur rckskel library [36] offers convenience and parallel pro-gramming functions to further accelerate development on theNoC On future larger NoC processors optimal solutionsmayrequire the use of clusters of cores concurrently processingPSC jobs with different methods Using the rckskel librarywould facilitate the software development for such morecomplicated scenarios hiding low level details from the usersThe advantages offered by an algorithmic skeletons librarylike the rckskel make it of interest to further develop thelibrary To this end we intend to introduce communicationcontrols that will allow true blocking implementations ofldquosendrdquo and ldquorecvrdquo regardless of the communication backendlibrary used (RCCE eg provides busy-wait loops for block-ing) Such an implementation would increase the energy effi-ciency of applications built using the library Further work on

BioMed Research International 11

the librarywould involve separating the algorithmic skeletonscode from the communication subsystem so that it can beretargeted to other processors This would essentially requiresand-boxing the RCCE calls behindwell-structuredAPIs thatcan allow a different communication library to be plugged in

Further we would like to investigate the performancecharacteristics of all-to-all MCPSC on multicore and many-core processors with very large datasets Experimenting withsuch large datasets like the one used in [56] with more than6000 domains is likely to provide higher granularity andassist in more accurate assessment of the impact of differentload balancing schemes in terms of speedup delivered Addi-tionally cluster analysis on the results of such a large datasetwill provide a better measure of the biological relevance ofclusters generated by MCPSC A dataset of that size wouldgenerate more than 32 million pairwise comparison tasksper PSC method and keeping the number of methods to3 (as is the case in this work) would result in nearly 10million tasks Due to these large numbers processing sucha dataset specifically on a many-core processor would befeasible only with a larger NoC that is one which supportsmore memory and uses cores of newer architectures thanthe SCC Showing that such a large computation can becompleted in reasonable time would however further justifythe drive towards leveraging parallel processing capabilitiesof multicore and many-core processors

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Anuj Sharma contributed to the design of the study devel-oped the software framework and performed the experi-ments Elias S Manolakos conceived the study and con-tributed to the design and evaluation All authors contributedto the discussion of results andwrote and approved the paper

Acknowledgments

Mr Anuj Sharma and Dr Elias S Manolakos would like toacknowledge the support of this research by the EuropeanUnion (European Social FundESF) andGreek national fundsthrough the Operational Program ldquoEducation and LifelongLearningrdquo of the National Strategic Reference Framework(NSRF) Research Funding Program ldquoHeracleitus IIrdquo Theauthors would also like to acknowledge the MicroLab ECENational Technical University of Athens Greece for allowinguse of their SCC infrastructure In addition we thank MrDimitrios Rodopoulos and Professor Dimitrios Soudris ofNTUA for providing comments and assistance

References

[1] R A Laskowski J D Watson and J M Thornton ldquoProFunca server for predicting protein function from 3D structurerdquoNucleic Acids Research vol 33 no 2 pp 89ndash93 2005

[2] B Pang N Zhao M Becchi D Korkin and C-R ShyuldquoAccelerating large-scale protein structure alignments withgraphics processing unitsrdquo BMC Research Notes vol 5 no 1article 116 2012

[3] E Bielska X Lucas A Czerwiniec J M Kasprzak K HKaminska and J M Bujnicki ldquoVirtual screening strategies indrug design methods and applicationsrdquo Journal of Biotechnol-ogy Computational Biology and Bionanotechnology vol 92 no3 pp 249ndash264 2011

[4] H Kubinyi ldquoStructure-based design of enzyme inhibitors andreceptor ligandsrdquo Current Opinion in Drug Discovery andDevelopment vol 1 no 1 pp 4ndash15 1998

[5] G Lancia and S Istrail ldquoProtein structure comparison algo-rithms and applicationsrdquo in Mathematical Methods for ProteinStructure Analysis and Design vol 2666 of Lecture Notes inComputer Science pp 1ndash33 Springer Berlin Germany 2003

[6] I Eidhammer I Jonassen andW R Taylor ldquoStructure compar-ison and structure patternsrdquo Journal of Computational Biologyvol 7 no 5 pp 685ndash716 2000

[7] Y B Ruiz-Blanco W Paz J Green and Y Marrero-PonceldquoProtDCal a program to compute general-purpose-numericaldescriptors for sequences and 3D-structures of proteinsrdquo BMCBioinformatics vol 16 no 1 article 162 2015

[8] D Barthel J D Hirst J Błazewicz E K Burke and NKrasnogor ldquoProCKSI a decision support system for protein(structure) comparison knowledge similarity and informa-tionrdquo BMC Bioinformatics vol 8 article 416 2007

[9] M Arriagada and A Poleksic ldquoOn the difference in qualitybetween current heuristic and optimal solutions to the proteinstructure alignment problemrdquo BioMed Research Internationalvol 2013 Article ID 459248 8 pages 2013

[10] D E Robillard P T Mpangase S Hazelhurst and F DehneldquoSpeeDB fast structural protein searchesrdquo Bioinformatics vol31 no 18 pp 3027ndash3034 2015

[11] M Veeramalai D Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 no 1 article 138 2010

[12] S B Pandit and J Skolnick ldquoFr-TM-align a new proteinstructural alignment method based on fragment alignmentsand the TM-scorerdquo BMC Bioinformatics vol 9 no 1 article 5312008

[13] L Mavridis and D Ritchie ldquo3D-blast 3D protein structurealignment comparison and classification using spherical polarFourier correlationsrdquo in Proceedings of the Pacific Symposium onBiocomputing pp 281ndash292 World Scientific PublishingKamuela Hawaii USA January 2010 httphalinriafrinria-00434263en

[14] A A Shah G Folino and N Krasnogor ldquoToward high-throughput multicriteria protein-structure comparison andanalysisrdquo IEEE Transactions on Nanobioscience vol 9 no 2 pp144ndash155 2010

[15] L I Kuncheva and C J Whitaker ldquoMeasures of diversity inclassifier ensembles and their relationship with the ensembleaccuracyrdquo Machine Learning vol 51 no 2 pp 181ndash207 2003

[16] P Sollich and A Krogh ldquoLearning with ensembles howoverfitting can be usefulrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo96) vol 8 pp190ndash196 1996

12 BioMed Research International

[17] G Brown J Wyatt R Harris and X Yao ldquoDiversity creationmethods a survey and categorisationrdquo Information Fusion vol6 no 1 pp 5ndash20 2005

[18] P Chi Efficient protein tertiary structure retrievals and clas-sifications using content based comparison algorithms [PhDdissertation] University of Missouri at Columbia ColumbiaMo USA 2007

[19] A Poleksic ldquoAlgorithms for optimal protein structure align-mentrdquo Bioinformatics vol 25 no 21 pp 2751ndash2756 2009

[20] A A Shah Studies on distributed approaches for large scalemulti-criteria protein structure comparison and analysis [PhDthesis] University of Nottingham Nottingham UK 2011httpeprintsnottinghamacuk11735

[21] M Pharr and R Fernando GPU Gems 2 Programming Tech-niques for High-Performance Graphics and General-PurposeComputation Pearson Eduction Addison-Wesley Professional2005

[22] M Azimi N Cherukuri D Jayashima et al ldquoIntegrationchallenges and tradeoffs for tera-scale architecturesrdquo Intel Tech-nology Journal vol 11 pp 173ndash184 2007

[23] S Sarkar T Majumder A Kalyanaraman and P P PandeldquoHardware accelerators for biocomputing a surveyrdquo in Pro-ceedings of the IEEE International Symposium on Circuits andSystems (ISCAS rsquo10) pp 3789ndash3792 IEEE Paris France June2010

[24] E Kouskoumvekakis D Soudris and E S Manolakos ldquoMany-core CPUs can deliver scalable performance to stochasticsimulations of large-scale biochemical reaction networksrdquo inProceedings of the International Conference onHigh PerformanceComputing amp Simulation (HPCS rsquo15) pp 517ndash524 IEEE Ams-terdam The Netherlands July 2015

[25] A A Shah D Barthel and N Krasnogor ldquoGrid and distributedpublic computing schemes for structural proteomics a shortoverviewrdquo in Frontiers of High Performance Computing andNetworking ISPA 2007 Workshops vol 4743 pp 424ndash434Springer Berlin Germany 2007

[26] E Totoni B Behzad S Ghike and J Torrellas ldquoComparing thepower and performance of Intelrsquos SCC to state-of-the-art CPUsand GPUsrdquo in Proceedings of the IEEE International Symposiumon Performance Analysis of Systems amp Software (ISPASS rsquo12) pp78ndash87 IEEE Computer Society New Brunswick NJ USA April2012

[27] T Bjerregaard and S Mahadevan ldquoA survey of research andpractices of network-on-chiprdquoACMComputing Surveys vol 38no 1 article 1 Article ID 1132953 2006

[28] B Marker E Chan J Poulson et al ldquoProgramming many-corearchitecturesmdasha case study dense matrix computations on theIntel single-chip cloud computer processorrdquo Concurrency andComputation Practice and Experience 2011

[29] G Blake R G Dreslinski and TMudge ldquoA survey of multicoreprocessorsrdquo IEEE Signal Processing Magazine vol 26 no 6 pp26ndash37 2009

[30] P Guerrier and A Greiner ldquoA generic architecture for on-chip packet-switched interconnectionsrdquo in Proceedings of theConference on Design Automation and Test in Europe (DATErsquo00) pp 250ndash256 ACM Paris France 2000

[31] D Atienza F Angiolini S Murali A Pullini L Benini and GDe Micheli ldquoNetwork-on-chip design and synthesis outlookrdquoIntegration vol 41 no 2 pp 1ndash35 2008

[32] R P Mohanty A K Turuk and B Sahoo ldquoPerformanceevaluation of multi-core processors with varied interconnectnetworksrdquo in Proceedings of the 2nd International Conference onAdvanced Computing Networking and Security (ADCONS rsquo13)pp 7ndash11 IEEE Computer Society Mangalore India December2013

[33] S Isaza Multicore architectures for bioinformatics applications[PhD thesis] University of Lugano Lugano Switzerland 2011

[34] S B Needleman and C D Wunsch ldquoA general method appli-cable to the search for similarities in the amino acid sequenceof two proteinsrdquo Journal of Molecular Biology vol 48 no 3 pp443ndash453 1970

[35] S Sarkar G R Kulkarni P P Pande and A Kalyanara-man ldquoNetwork-on-chip hardware accelerators for biologicalsequence alignmentrdquo IEEE Transactions on Computers vol 59no 1 pp 29ndash41 2010

[36] A Sharma A Papanikolaou and E SManolakos ldquoAcceleratingall-to-all protein structures comparison with TMalign usinga NoC many-cores processor architecturerdquo in Proceedings ofthe 27th IEEE International Parallel and Distributed ProcessingSymposium Workshops amp PhD Forum (IPDPSW rsquo13) pp 510ndash519 IEEE Cambridge Mass USA May 2013

[37] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[38] N Krasnogor and D A Pelta ldquoMeasuring the similarity ofprotein structures by means of the universal similarity metricrdquoBioinformatics vol 20 no 7 pp 1015ndash1021 2004

[39] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[40] N Melot K Avdic C Kessler and J Keller ldquoInvestigationof main memory bandwidth on Intel Single-Chip Cloudcomputerrdquo in Proceedings of the 3rd Many-Core ApplicationsResearch Community Symposium (MARC rsquo11) pp 107ndash110Ettlingen Germany July 2011

[41] S Saini H Jin R Hood D Barker P Mehrotra and RBiswas ldquoThe impact of hyper-threading on processor resourceutilization in production applicationsrdquo in Proceedings of the18th International Conference on High Performance Computing(HiPC rsquo11) pp 1ndash10 Bangalore India December 2011

[42] H Gonzalez-Velez and M Leyton ldquoA survey of algorithmicskeleton frameworks high-level structured parallel program-ming enablersrdquo Software Practice and Experience vol 40 no12 pp 1135ndash1160 2010

[43] D K G Campbell ldquoTowards the classification of algorithmicskeletonsrdquo Tech Rep YCS 276 1996

[44] A Silberschatz P B Galvin and G Gagne Operating SystemConcepts John Wiley amp Sons Wiley Publishing 8th edition2008

[45] D M Tax and R P W Duin ldquoFeature scaling in support vectordata descriptionsrdquo in Learning from Imbalanced Datasets NJapkowicz Ed pp 25ndash30 AAAI Press Menlo Park Calif USA2000

[46] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

BioMed Research International 13

[47] R Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing ViennaAustria 2015 httpwwwr-projectorg

[48] M Veeramalai D R Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 article 138 2010

[49] G Maccari G L B Spampinato and V Tozzini ldquoSecStAnTsecondary structure analysis tool for data selection statisticsand models buildingrdquo Bioinformatics vol 30 no 5 pp 668ndash674 2014

[50] S Mertens ldquoThe easiest hard problem number partitioningrdquoin Computational Complexity and Statistical Physics A PercusG Istrate and C Moore Eds pp 125ndash139 Oxford UniversityPress New York NY USA 2006

[51] OpenMP API for parallel programming version 40 httpopenmporgwp

[52] J Handl J D Knowles and D B Kell ldquoComputational clustervalidation in post-genomic data analysisrdquo Bioinformatics vol21 no 15 pp 3201ndash3212 2005

[53] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[54] M P Forum ldquoMPI amessage-passing interface standardrdquo TechRep University of Tennessee Knoxville Tenn USA 1994

[55] Q Zou X-B Li W-R Jiang Z-Y Lin G-L Li and K ChenldquoSurvey of MapReduce frame operation in bioinformaticsrdquoBriefings in Bioinformatics 2013

[56] N Malod-Dognin and N Przulj ldquoGR-Align fast and flexiblealignment of protein 3D structures using graphlet degreesimilarityrdquo Bioinformatics vol 30 no 9 pp 1259ndash1265 2014

[57] W Xie and N V Sahinidis ldquoA reduction-based exact algorithmfor the contact map overlap problemrdquo Journal of ComputationalBiology vol 14 no 5 pp 637ndash654 2007

[58] L P Chew and K Kedem ldquoFinding the consensus shape for aprotein family (extended abstract)rdquo in Proceedings of the 18thACM Annual Symposium on Computational Geometry (SCGrsquo02) pp 64ndash73 ACM Barcelona Spain June 2002

[59] D Fischer A Elofsson D Rice and D Eisenberg ldquoAssessingthe performance of fold recognition methods by means ofa comprehensive benchmarkrdquo in Proceedings of the PacificSymposium on Biocomputing pp 300ndash318 January 1996

[60] B Rost and C Sander ldquoPrediction of protein secondary struc-ture at better than 70 accuracyrdquo Journal of Molecular Biologyvol 232 no 2 pp 584ndash599 1993

[61] A Caprara R D Carr S Istrail G Lancia and B Walenzldquo1001 Optimal PDB structure alignments integer programmingmethods for finding the maximum contact map overlaprdquoJournal of Computational Biology vol 11 no 1 pp 27ndash52 2004

[62] R Andonov N Yanev and N Malod-Dognin ldquoAn efficientlagrangian relaxation for the contact map overlap problemrdquo inAlgorithms in Bioinformatics 8th InternationalWorkshopWABI2008 Karlsruhe Germany September 15ndash19 2008 ProceedingsK A Crandall and J Lagergren Eds vol 5251 of Lecture Notesin Computer Science pp 162ndash173 Springer Berlin Germany2008

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 6: Research Article Efficient Multicriteria Protein Structure ...mark MPBT data. Legacy soware by default does not mark memory as MPBT and runs without modi cations. If the bit is set,

6 BioMed Research International

Table 4 Sum of squares of residuals after curve fitting for the different PSC methods The table shows that the sum of residuals is lower withquadratic fit (square) as compared to linear fit All values are multiplied with 1119890 + 06

PSC methodResiduals (sum of lengths) Residuals (product of lengths)

CK34 RS119 CK34 RS119Linear Square Linear Square Linear Square Linear Square

TMalign 4 1 100 90 1 1 40 30CE 200 100 4000 1000 100 50 2000 800

0123456789

01 02 03 04 05 06 07 08 09 1

Tim

e (se

c)

0123456789

Tim

e (se

c)

Sum of lengths of pair proteins

01 02 03 04 05 06 07 08 09 1

CK34 dataset

CETMalign

CETMalign

CETMalign

CK34 dataset

0

5

10

15

20

25

0 01 02 03 04 05 06 07 08 09 1Ti

me (

sec)

Tim

e (se

c)Sum of lengths of pair proteins

RS119 dataset

CETMalign

00 01 02 03 04 05 06 07 08 09 1Product of lengths of pair proteinsProduct of lengths of pair proteins

RS119 dataset25

20

15

10

5

0

Figure 2 Comparison of pairwise PSCprocessing times usingTMalign andCEwith respect to the normalized sumof lengths andnormalizedproduct of lengths of pair proteins The dotted lines show the quadratic best fit for the data

the lengths of the proteins forming the pair USM requiresextremely small pairwise comparison times as shown inTable 1 and was thus excluded from this analysis The bestfit is achieved when using the product of lengths of the pairproteins as an attribute and is better than the fit achievedwhen using the sum of the lengths Moreover the quadraticcurve fitting results are better than the linear as shown inTable 4 For a given PSC method the relative times requiredfor completing a pairwise PSC task depend on the proper-ties of the pairs participating However the absolute timerequired for completing a pairwise PSC task depends also onthe complexity of the method used Since the complexity ofone pairwise PSC method as compared to another cannot be

known a priori it cannot be used as a factor for a generic jobspartitioning scheme for load balancing purposes

33 Speedup and Efficiency Analysis of Multiple Criteria PSCon the SCC NoC CPU In this work we experimented withboth round-robin and job partitioning schemes for loadbalancing Given the close correlation between lengths of pairproteins and processing timeswe used this as criteria for bothRound-robin [44] is a dynamic load balancing method inwhich the master process maintains a list of jobs and handsthe next job in the list to the next free slave process It isworth noting that if the list is presorted based on attributesthat are known to be proportional to the processing times this

BioMed Research International 7

Table 5 Comparison of speedup and efficiency achieved by the load balancing schemes in processing the all-to-all MCPSC task vis-a-vie thesingle core (SCC) processing times The efficiency is calculated assuming 47 processing elements (one PE serves as master) All times are inseconds

Load balancing scheme Dataset CK34 Dataset RS119Time Speedup Efficiency Time Speedup Efficiency

Serial (1 SCC core) 9698 mdash mdash 166000 mdash mdashRandom 479 20 043 4765 35 074Greedy partitioning (sum) 409 24 050 4300 39 082Greedy partitioning (product) 396 28 059 4433 38 080Round-robin (sum sorted) 239 41 086 3930 42 090Round-robin (product sorted) 238 41 086 3930 42 090

would naturally result in little to no idle times for the slaveprocesses For static job partitioning determining job-to-coremapping is equivalent to constructing119873 equal partitionsfor a list of elements which is an NP-hard problem [50]In our case 119873 is the number of cores and is equal to 47when all SCC cores are used We investigated methods forPSC jobs partitioning based on the sum and the product ofthe lengths of the two proteins to be compared We havestatically partitioned the PSC jobs using as attribute the sumof lengths (product of lengths) of the paired proteins to becompared The partitions are generated using the LongestProcessing Time (LPT) algorithm the jobs are sorted indescending order by sum of lengths (product of lengths)and then assigned to the partition with the smallest totalrunning sumThese partitions will be referred to as ldquoGreedyrdquopartitions fromnow on Since the job partitions are processedon different cores this strategy is expected to balance theprocessing times by reducing idle times

Figure 3 shows the partitions generated using a randompartitioning and a greedy partitioning scheme based on thesum of lengths and the product of lengths as attributes Parti-tions created with the random scheme with equal number ofpair proteins per partition vary greatly in terms of the totalnormalized sum of the sum (or product) of the pair proteinlengths The size of this normalized sum is expected to beproportional to the overall processing time each partitionwillrequire when assigned to a core Therefore if each partitionof PSC tasks created with the random scheme is processedon a different core several cores will have significant idletimes at the end On the contrary partitions created with thegreedy scheme are almost equal as to the normalized sum ofthe sum (or product) of the pair protein lengths Hence ifeach partition of PSC tasks created by the greedy scheme isprocessed on a different processing element (PE) idle timesare expected to be minimized

We extended the framework introduced in [36] by incor-porating more PSC methods (CE and USM) in additionto TMalign and computing consensus scores in order toperform Multicriteria PSC We retain the master-slaves pro-cessing model in the extended framework the first availablecore runs the master process and all other cores run slaveprocesses The master loads the data and generates a list ofjobs each one involving a single pairwise Protein StructureComparison operation to be performed using a specifiedPSC method Due to the vastly different processing times

0163248

0163248

0163248

0 2 4 6 8 10 12

Cor

e Id

Random split CK34 dataset

0 05 1 15 2 25 3 35 4

0 05 1 15 2 25 3 35 4

Cor

e Id

Greedy (sum) split CK34 dataset

Cor

e Id

Greedy (prod) split CK34 dataset

Figure 3 The 47 partitions created from list of PSC tasks sortedrandomly or by the sum (product) of lengths of pair proteins Eachhorizontal line represents the sum of the normalized lengths of theprotein pairs assigned to that partition

of PSC operations when using different methods we allowthe master process (a) to make a choice between sequentialversus parallel job processing and (b) implement a loadbalancing scheme for the slavesThe first allows the master tosequentially process jobs of a PSC method when distributingthem which would actually result in an increased processingtime The second allows the master to reduce idle times inslave processes which perform PSC continuously receivingjobs from and returning the results to the master till theyreceive a terminate signal

The consensus value for a protein pair is calculated usingthe scores from the three PSC methods (a) the USM-metricfrom USM (b) the TM-score from TMalign and (c) the RootMean Square Distance (RMSD) from CE The USM-metricand RMSD are distances and have a zero value for identicalstructures The TM-score is therefore inverted since it is theonly one of the three which is a similarity metric At the endthe average of the three normalized dissimilarity scores istaken as the MCPSC score for the protein pair

In Table 5 we compare the performance of the master-slaves setup with one master and 47 slaves to that of a singlecore of the SCC and show the speedup and the efficiencyachieved using the CK34 and RS119 datasets TMalign andCE jobs were run under the master-slaves model while USM

8 BioMed Research International

Table 6 Speedup efficiency and throughput in pairwise-PSC tasks per second for performing MCPSC on an Intel Core i7 multicore CPUusing the CK34 and RS119 datasets

Threads Dataset CK34 Dataset RS119Speedup Efficiency Throughput Speedup Efficiency Throughput

1 100 100 492 100 100 3492 188 094 927 146 073 5093 204 068 1001 198 066 6904 254 063 1247 260 065 9085 270 054 1325 270 054 9426 271 045 1332 271 045 9457 274 039 1347 269 038 9388 264 033 1298 267 033 931

jobs were processed by the master due to the negligible timerequired by these jobs

The greedy static partitioning scheme improves on theperformance of the random partitioning scheme but it isinadequate for the SCC where the cost of data transferbetween cores is low This scheme would be of interestin clusters of workstations where interconnection networksare slow The greedy partitioning scheme could also bebeneficial on larger many-core processors where the mastermay become a bottleneck making off-line creation of batchesof PSC tasks essential In such a scenario however othertechniques such as work-stealing would need to be assessedbefore selecting the most appropriate method

The round-robin dynamic job assignment strategy out-performs the best offline partitioning scheme Furthermorethe sorted round-robin flawlessly balances out the cores pro-cessing times on both datasets making it the best approachfor distributing jobs to the cores of the SCC In this approachthe protein pairs are presorted in descending order of product(or sum) of pair protein lengths Specifically for the one-to-many PSC case we simply retrieve the database proteins indescending order of length and create pairwise PSC taskswiththe query protein For the many-to-many PSC case howeverwe need to determine the correct order of the proteins whenthe query proteins are received

The sorted round-robin strategy distributes the jobs mostefficiently for the proposed master-slaves setup on the IntelSCC with 47 slaves This setup achieves a 42-fold (40-fold)speedup for the RS119 (CK34) dataset as compared to a singlecore of the SCCThe speedup is almost linear which suggeststhat high efficiency can be achieved evenwhen themany-coreprocessor has more nodes A bigger many-core processorhowever may require using a hierarchy of masters to avoida bottleneck on a single master node In such a scenarioa combination of partitioning schemes and round-robinjob assignment could be used to distribute and minimizeidle times For instance on larger NoC processors a well-balanced solution may require the use of clusters of process-ing elements concurrently processing PSC jobs with differentmethods All PEs computing jobs of a PSC method wouldreceive jobs from a specific submaster All such submasterswould in turn receive jobs from a common global master

34 Comparison of Multicriteria PSC Performance on Multi-core and Many-Core CPUs In order to perform this exper-iment the MCPSC software framework was retargeted tothe Intel Core i7 multicore processor using openmp [51] toimplement multithreading This experiment allowed us toassess how the MCPSC problem scales with increasing num-ber of cores on a modern multicore CPU readily availablefor scientists and engineers This multicore software versionuses shared memory to replace the RCCE based messagepassing (recv and send) calls used in the many-core versiondeveloped for the Intel SCC As shown in Table 6 we observespeedup on this multicore processor when running all-to-allMCPSCwith the CK34 and RS119 datasets up to the 4 threadsconfigurations Thereafter a steep speedup drop (efficiencyloss) is observed

This is attributed to the fact that this is a quad-coreprocessor that implements Hyper Threading (HT) [41] Sinceall threads operate on exactly the same type of instruc-tions workload (SPMD) it cannot take advantage of eachcorersquos super scalar architecture that needs varied workloadplacement (eg integer and floating point arithmetic at thesame time) to show performance advantages when utilizingmore threads than cores Further an SMP Operating System(OS) like the one we run on our test system uses allsystem cores as resources for swapping threads in and outA thread executing on core 1198751198641 for some number of cyclesmay get swapped out and later resume execution on core1198751198642 Thus the thread cache on core 1198751198641 becomes useless andit gets a lot of cache misses when restarting on 1198751198642 ThisOS behavior impacts cache performance and reduces theoverall application performance In terms of pairwise-PSCtasks per second (pps) the Intel Core i7 achieves a through-put of 1347 pps (945 pps) on the CK34 (RS119) dataset ascompared to 553 pps (363 pps) achieved by the SCC Thisis due to the fact that the multicore CPU contains latestgeneration cores featuring a highly superior out-of-ordermicroarchitecture and clocked at almost 75x the frequencyof the SCC We believe that even more performance can beexploited from next generation multicore processors shouldthey start introducing hardware memory structures like theMessage Passing Buffer (MPB) of the SCC NoC processoralongside their cores for increased communication efficiency

BioMed Research International 9

Table 7 Comparison of the SCC and i7 CPU in terms of throughputdelivered on all-to-all MCPSC

Dataset Pairs SCCthroughput

i7throughput Ratio

Skolnick 1089 605 2792 462Chew-Kedem 1156 503 1329 264Fischer 4624 198 937 474Rost-Sander 12996 330 938 285Lancia 72361 mdash 22265 mdashProteus 76729 mdash 677 mdash

among them without resorting to shared memory and itsunavoidable locks or cache-coherency protocol overheads

Finally Table 7 shows the comparative performance ofthe SCC (with 48 cores) and the i7 (running 7 threads)on all-to-all MCPSC using several datasets It can be seenthat the i7 outperforms the SCC consistently in terms ofthroughput The two larger datasets Lancia and Proteuswere not processed on the SCC because of the limited per-core memory (512MB) which was not sufficient to load thedomain structure data for the full dataset A decrease inthroughput was observed as the size of the dataset increaseswith some exceptions Both the SCC and the i7 deliver lowerthan expected throughput (based on its size) on the Fischerdataset which we believe is due to the higher complexity ofthe dataset As evidenced by Table 2 the Fischer dataset hasdomains belonging to 5 times more SCOP Super Families ascompared to CK34 and also has the second highest medianlength of protein domains among all the datasets used in thiswork Conversely the throughput delivered by the i7 on theLancia dataset is significantly higher than that delivered forthe similar size (in terms of number of domains) Proteusdataset We attribute this difference to the large differencebetween themean lengths of domains in each dataset as notedin Table 2

35 Qualitative Analysis of MCPSC Results Based on theChew-Kedem Dataset The acceleration capabilities offeredby modern processors (many-core and multicore) enablesperforming all-to-all PSC using different methods on large-scale datasets and compare results For example it becomeseasy to perform a qualitative analysis on a set of proteindomains and use consensus score (GPrsquoS) to explore relationbetween the protein domains categorize them into bio-logically relevant clusters [48] and automate generation ofdatabases such as Structural Classification of Proteins (SCOP)[46]

The MCPSC comparison method outperforms the com-ponent PSCmethods (TMalign CE andUSM) and groups 34representative proteins fromfivefold families into biologicallysignificant clusters In Figure 4 each protein domain is usingthe format ldquodomainName foldFamiliesrdquo The tags of theldquofoldFamiliesrdquo belong to one of (i) tb TIM barrel (ii) gglobins (iii) ab alpha beta (iv) b all beta and (v) a all alphaprotein families The clusters obtained from the MCPSCmethod show that most domains are grouped correctly

according to their structural fold In comparison all thecomponent PSCmethods produce clusters in which there aremany wrongly grouped domains Details of domain clusterallocation of all four methods are included in the Supple-mentary Materials file (see Supplementary Material availableonline at httpdxdoiorg1011552015563674) dom clustxlswith clusters generated by USM containing the highestnumber of errors The quantitative analysis evaluating theRecall versus Precision tradeoff based on the 119865-measure [52]results in a value of 119865 = 091 for the MCPSC method whichis higher than TMalign (119865 = 082) CE (119865 = 071) andUSM (119865 = 062) Finally in the Supplementary Materialfile supplementary materialpdf weprovide clusters generatedby hclust [47] for the four PSC methods (see Figures S1S2 and S3) PSC scores generated for all the datasets for allPSCmethods are included in the SupplementaryMaterial filepsc scoreszip

4 Conclusions and Future Work

In the near future we will see a continued increase inthe amount of cores integrated on a single chip as well asincreased availability of commodity many-core processorsThis trend is already visible today in that off-the-shelf PCscontain processors with up to 8 cores server grade machinesoften contain multiple processors with up to 32 cores andmulticore processors are even appearing in average grademobile devices Further the success of GPUs Tilerarsquos archi-tecture and Intelrsquos initiative for integrating more and morecores on a single chip also attest to this trend The ubiquityand the advanced core architectures employed by multicoreCPUsmake it imperative to build software that can efficientlyutilize their processing power Our experiments show that amodern Core i7 is able to deliver high throughput for all-to-all MCPSC Utilizing the performance gain delivered bymul-ticoreCPUbecomes especially importantwhen considered incombination with distributed setups such as clusters whichmay already contain nodes with multiple cores

Both multicore and many-core architectures lend them-selves to the popular and powerful MapReduce parallelprogramming model [53] If the underlying communicationlibrariesmdashtypically built on top of Message Passing Interface(MPI) [54]mdashare available this highly popular approachcan be applied to large-scale MCPSC in a manner similarto its application in other domains of bioinformatics [55]While some implementation of MapReduce is available formulticore machines it must be noted that no MapReduceframework implementations are available for the SCC Onepossible approach for using MapReduce in MCPSC wouldbe to (a) determine pairwise PSC to be performed on eachPE (pre-Map) (b) run all PSC methods on all pairs on eachPE in parallel (Map) (c) collect PSC scores from all PEsto a designated PE (shuffle) and (d) combine and generatefinal MCPSC scores (Reduce) Movement of data for taskdistribution (pre-Map) intermediate data transfer performedby the ldquoshufflerdquo mechanism and collection of results ifrequired by the Reduce step would benefit from low cost ofinter-PE communication Load balancing methods similar to

10 BioMed Research International

6xi

a_tb

1jh

g_a

1m

ba_g

1flp

_g1

ith_g

5m

bn_g

1m

yt_g

1ec

a_g

2hb

g_g

1ba

b_g

1as

h_g

3sd

h_g

2lh

b_g

1hl

b_g

1hl

m_g

2vh

b_g

1lh2

_g2

mnr

_tb

1ch

r_tb

4en

l_tb

5p2

1_a

b1

qra_

ab1

gnp_

ab6

q21

_ab

1aa9

_ab

1qf

o_b

1ne

u_b

1cd8

_b1

qa9

_b1

hnf_

b1

cdb_

b1

ci5

_b1

ct9

_ab

1cn

p_a

00

01

02

03

04

05

06

MCPSC

Figure 4 Hierarchical clustering result using the Chew-Kedem dataset and MCPSC score as distance metric between domains Each boxrepresents a cluster and the domains belonging to it The average linkage method was used to build the dendrogram

those discussed in this work could be useful in determiningeffective strategies for the distribution of PSC tasks (pre-Map) Analysis of MCPSC performance on multicore andmany-core processors withMapReduce would be required tounderstand the design trade-offs that would lead to efficientdesigns

Scalability limitation of the bus-based architecture cou-pled with the greater potential of Network-on-Chip basedprocessors to deliver efficiency and high performance impliesthat NoCs are likely to be used extensively in future many-core processors It is therefore important to start developingsoftware frameworks and solutions that capitalize on thisparallel architecture to meet the increasing computationaldemands in structural proteomics and bioinformatics ingeneral Comparison of many-core and multicore processorsin this work shows that the former can deliver higherefficiency than the latter processor architectures Furtherour experimental results make it clear that Intelrsquos SCC NoCmatches the speedup and efficiency achieved by a clusterof faster workstations [14] However a market ready NoCCPU will have a much better per watt performance ascompared to a cluster while also saving in space and infras-tructuremanagement costs Additionally any savings in wattsconsumed also reflect in savings in cooling infrastructurerequired for the hardware Furthermore with the per wattperformance being in focus for processor manufacturerssuch as Intel and Tilera this gap is set to expand evenfurther Finally many-core processors when used as the

CPUs of near-future clusters of workstations will providean additional level of local parallelism to applications whereperformance scalability is essential to tackle ever increasingcomputational demands As this study demonstrates thelikely increase in the availability and ubiquitousness of many-core processors the near linear speedup in tackling theMCPSC scenario and the ease of porting new PSC methodsto NoC based processors make many-core processors be ofgreat interest for the high performance structural proteomicsand bioinformatics communities in general

Incorporating additional PSC methods to a many-coreNoC processor is straightforward due to the familiar pro-gramming model and architecture of the individual coresOur rckskel library [36] offers convenience and parallel pro-gramming functions to further accelerate development on theNoC On future larger NoC processors optimal solutionsmayrequire the use of clusters of cores concurrently processingPSC jobs with different methods Using the rckskel librarywould facilitate the software development for such morecomplicated scenarios hiding low level details from the usersThe advantages offered by an algorithmic skeletons librarylike the rckskel make it of interest to further develop thelibrary To this end we intend to introduce communicationcontrols that will allow true blocking implementations ofldquosendrdquo and ldquorecvrdquo regardless of the communication backendlibrary used (RCCE eg provides busy-wait loops for block-ing) Such an implementation would increase the energy effi-ciency of applications built using the library Further work on

BioMed Research International 11

the librarywould involve separating the algorithmic skeletonscode from the communication subsystem so that it can beretargeted to other processors This would essentially requiresand-boxing the RCCE calls behindwell-structuredAPIs thatcan allow a different communication library to be plugged in

Further we would like to investigate the performancecharacteristics of all-to-all MCPSC on multicore and many-core processors with very large datasets Experimenting withsuch large datasets like the one used in [56] with more than6000 domains is likely to provide higher granularity andassist in more accurate assessment of the impact of differentload balancing schemes in terms of speedup delivered Addi-tionally cluster analysis on the results of such a large datasetwill provide a better measure of the biological relevance ofclusters generated by MCPSC A dataset of that size wouldgenerate more than 32 million pairwise comparison tasksper PSC method and keeping the number of methods to3 (as is the case in this work) would result in nearly 10million tasks Due to these large numbers processing sucha dataset specifically on a many-core processor would befeasible only with a larger NoC that is one which supportsmore memory and uses cores of newer architectures thanthe SCC Showing that such a large computation can becompleted in reasonable time would however further justifythe drive towards leveraging parallel processing capabilitiesof multicore and many-core processors

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Anuj Sharma contributed to the design of the study devel-oped the software framework and performed the experi-ments Elias S Manolakos conceived the study and con-tributed to the design and evaluation All authors contributedto the discussion of results andwrote and approved the paper

Acknowledgments

Mr Anuj Sharma and Dr Elias S Manolakos would like toacknowledge the support of this research by the EuropeanUnion (European Social FundESF) andGreek national fundsthrough the Operational Program ldquoEducation and LifelongLearningrdquo of the National Strategic Reference Framework(NSRF) Research Funding Program ldquoHeracleitus IIrdquo Theauthors would also like to acknowledge the MicroLab ECENational Technical University of Athens Greece for allowinguse of their SCC infrastructure In addition we thank MrDimitrios Rodopoulos and Professor Dimitrios Soudris ofNTUA for providing comments and assistance

References

[1] R A Laskowski J D Watson and J M Thornton ldquoProFunca server for predicting protein function from 3D structurerdquoNucleic Acids Research vol 33 no 2 pp 89ndash93 2005

[2] B Pang N Zhao M Becchi D Korkin and C-R ShyuldquoAccelerating large-scale protein structure alignments withgraphics processing unitsrdquo BMC Research Notes vol 5 no 1article 116 2012

[3] E Bielska X Lucas A Czerwiniec J M Kasprzak K HKaminska and J M Bujnicki ldquoVirtual screening strategies indrug design methods and applicationsrdquo Journal of Biotechnol-ogy Computational Biology and Bionanotechnology vol 92 no3 pp 249ndash264 2011

[4] H Kubinyi ldquoStructure-based design of enzyme inhibitors andreceptor ligandsrdquo Current Opinion in Drug Discovery andDevelopment vol 1 no 1 pp 4ndash15 1998

[5] G Lancia and S Istrail ldquoProtein structure comparison algo-rithms and applicationsrdquo in Mathematical Methods for ProteinStructure Analysis and Design vol 2666 of Lecture Notes inComputer Science pp 1ndash33 Springer Berlin Germany 2003

[6] I Eidhammer I Jonassen andW R Taylor ldquoStructure compar-ison and structure patternsrdquo Journal of Computational Biologyvol 7 no 5 pp 685ndash716 2000

[7] Y B Ruiz-Blanco W Paz J Green and Y Marrero-PonceldquoProtDCal a program to compute general-purpose-numericaldescriptors for sequences and 3D-structures of proteinsrdquo BMCBioinformatics vol 16 no 1 article 162 2015

[8] D Barthel J D Hirst J Błazewicz E K Burke and NKrasnogor ldquoProCKSI a decision support system for protein(structure) comparison knowledge similarity and informa-tionrdquo BMC Bioinformatics vol 8 article 416 2007

[9] M Arriagada and A Poleksic ldquoOn the difference in qualitybetween current heuristic and optimal solutions to the proteinstructure alignment problemrdquo BioMed Research Internationalvol 2013 Article ID 459248 8 pages 2013

[10] D E Robillard P T Mpangase S Hazelhurst and F DehneldquoSpeeDB fast structural protein searchesrdquo Bioinformatics vol31 no 18 pp 3027ndash3034 2015

[11] M Veeramalai D Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 no 1 article 138 2010

[12] S B Pandit and J Skolnick ldquoFr-TM-align a new proteinstructural alignment method based on fragment alignmentsand the TM-scorerdquo BMC Bioinformatics vol 9 no 1 article 5312008

[13] L Mavridis and D Ritchie ldquo3D-blast 3D protein structurealignment comparison and classification using spherical polarFourier correlationsrdquo in Proceedings of the Pacific Symposium onBiocomputing pp 281ndash292 World Scientific PublishingKamuela Hawaii USA January 2010 httphalinriafrinria-00434263en

[14] A A Shah G Folino and N Krasnogor ldquoToward high-throughput multicriteria protein-structure comparison andanalysisrdquo IEEE Transactions on Nanobioscience vol 9 no 2 pp144ndash155 2010

[15] L I Kuncheva and C J Whitaker ldquoMeasures of diversity inclassifier ensembles and their relationship with the ensembleaccuracyrdquo Machine Learning vol 51 no 2 pp 181ndash207 2003

[16] P Sollich and A Krogh ldquoLearning with ensembles howoverfitting can be usefulrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo96) vol 8 pp190ndash196 1996

12 BioMed Research International

[17] G Brown J Wyatt R Harris and X Yao ldquoDiversity creationmethods a survey and categorisationrdquo Information Fusion vol6 no 1 pp 5ndash20 2005

[18] P Chi Efficient protein tertiary structure retrievals and clas-sifications using content based comparison algorithms [PhDdissertation] University of Missouri at Columbia ColumbiaMo USA 2007

[19] A Poleksic ldquoAlgorithms for optimal protein structure align-mentrdquo Bioinformatics vol 25 no 21 pp 2751ndash2756 2009

[20] A A Shah Studies on distributed approaches for large scalemulti-criteria protein structure comparison and analysis [PhDthesis] University of Nottingham Nottingham UK 2011httpeprintsnottinghamacuk11735

[21] M Pharr and R Fernando GPU Gems 2 Programming Tech-niques for High-Performance Graphics and General-PurposeComputation Pearson Eduction Addison-Wesley Professional2005

[22] M Azimi N Cherukuri D Jayashima et al ldquoIntegrationchallenges and tradeoffs for tera-scale architecturesrdquo Intel Tech-nology Journal vol 11 pp 173ndash184 2007

[23] S Sarkar T Majumder A Kalyanaraman and P P PandeldquoHardware accelerators for biocomputing a surveyrdquo in Pro-ceedings of the IEEE International Symposium on Circuits andSystems (ISCAS rsquo10) pp 3789ndash3792 IEEE Paris France June2010

[24] E Kouskoumvekakis D Soudris and E S Manolakos ldquoMany-core CPUs can deliver scalable performance to stochasticsimulations of large-scale biochemical reaction networksrdquo inProceedings of the International Conference onHigh PerformanceComputing amp Simulation (HPCS rsquo15) pp 517ndash524 IEEE Ams-terdam The Netherlands July 2015

[25] A A Shah D Barthel and N Krasnogor ldquoGrid and distributedpublic computing schemes for structural proteomics a shortoverviewrdquo in Frontiers of High Performance Computing andNetworking ISPA 2007 Workshops vol 4743 pp 424ndash434Springer Berlin Germany 2007

[26] E Totoni B Behzad S Ghike and J Torrellas ldquoComparing thepower and performance of Intelrsquos SCC to state-of-the-art CPUsand GPUsrdquo in Proceedings of the IEEE International Symposiumon Performance Analysis of Systems amp Software (ISPASS rsquo12) pp78ndash87 IEEE Computer Society New Brunswick NJ USA April2012

[27] T Bjerregaard and S Mahadevan ldquoA survey of research andpractices of network-on-chiprdquoACMComputing Surveys vol 38no 1 article 1 Article ID 1132953 2006

[28] B Marker E Chan J Poulson et al ldquoProgramming many-corearchitecturesmdasha case study dense matrix computations on theIntel single-chip cloud computer processorrdquo Concurrency andComputation Practice and Experience 2011

[29] G Blake R G Dreslinski and TMudge ldquoA survey of multicoreprocessorsrdquo IEEE Signal Processing Magazine vol 26 no 6 pp26ndash37 2009

[30] P Guerrier and A Greiner ldquoA generic architecture for on-chip packet-switched interconnectionsrdquo in Proceedings of theConference on Design Automation and Test in Europe (DATErsquo00) pp 250ndash256 ACM Paris France 2000

[31] D Atienza F Angiolini S Murali A Pullini L Benini and GDe Micheli ldquoNetwork-on-chip design and synthesis outlookrdquoIntegration vol 41 no 2 pp 1ndash35 2008

[32] R P Mohanty A K Turuk and B Sahoo ldquoPerformanceevaluation of multi-core processors with varied interconnectnetworksrdquo in Proceedings of the 2nd International Conference onAdvanced Computing Networking and Security (ADCONS rsquo13)pp 7ndash11 IEEE Computer Society Mangalore India December2013

[33] S Isaza Multicore architectures for bioinformatics applications[PhD thesis] University of Lugano Lugano Switzerland 2011

[34] S B Needleman and C D Wunsch ldquoA general method appli-cable to the search for similarities in the amino acid sequenceof two proteinsrdquo Journal of Molecular Biology vol 48 no 3 pp443ndash453 1970

[35] S Sarkar G R Kulkarni P P Pande and A Kalyanara-man ldquoNetwork-on-chip hardware accelerators for biologicalsequence alignmentrdquo IEEE Transactions on Computers vol 59no 1 pp 29ndash41 2010

[36] A Sharma A Papanikolaou and E SManolakos ldquoAcceleratingall-to-all protein structures comparison with TMalign usinga NoC many-cores processor architecturerdquo in Proceedings ofthe 27th IEEE International Parallel and Distributed ProcessingSymposium Workshops amp PhD Forum (IPDPSW rsquo13) pp 510ndash519 IEEE Cambridge Mass USA May 2013

[37] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[38] N Krasnogor and D A Pelta ldquoMeasuring the similarity ofprotein structures by means of the universal similarity metricrdquoBioinformatics vol 20 no 7 pp 1015ndash1021 2004

[39] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[40] N Melot K Avdic C Kessler and J Keller ldquoInvestigationof main memory bandwidth on Intel Single-Chip Cloudcomputerrdquo in Proceedings of the 3rd Many-Core ApplicationsResearch Community Symposium (MARC rsquo11) pp 107ndash110Ettlingen Germany July 2011

[41] S Saini H Jin R Hood D Barker P Mehrotra and RBiswas ldquoThe impact of hyper-threading on processor resourceutilization in production applicationsrdquo in Proceedings of the18th International Conference on High Performance Computing(HiPC rsquo11) pp 1ndash10 Bangalore India December 2011

[42] H Gonzalez-Velez and M Leyton ldquoA survey of algorithmicskeleton frameworks high-level structured parallel program-ming enablersrdquo Software Practice and Experience vol 40 no12 pp 1135ndash1160 2010

[43] D K G Campbell ldquoTowards the classification of algorithmicskeletonsrdquo Tech Rep YCS 276 1996

[44] A Silberschatz P B Galvin and G Gagne Operating SystemConcepts John Wiley amp Sons Wiley Publishing 8th edition2008

[45] D M Tax and R P W Duin ldquoFeature scaling in support vectordata descriptionsrdquo in Learning from Imbalanced Datasets NJapkowicz Ed pp 25ndash30 AAAI Press Menlo Park Calif USA2000

[46] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

BioMed Research International 13

[47] R Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing ViennaAustria 2015 httpwwwr-projectorg

[48] M Veeramalai D R Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 article 138 2010

[49] G Maccari G L B Spampinato and V Tozzini ldquoSecStAnTsecondary structure analysis tool for data selection statisticsand models buildingrdquo Bioinformatics vol 30 no 5 pp 668ndash674 2014

[50] S Mertens ldquoThe easiest hard problem number partitioningrdquoin Computational Complexity and Statistical Physics A PercusG Istrate and C Moore Eds pp 125ndash139 Oxford UniversityPress New York NY USA 2006

[51] OpenMP API for parallel programming version 40 httpopenmporgwp

[52] J Handl J D Knowles and D B Kell ldquoComputational clustervalidation in post-genomic data analysisrdquo Bioinformatics vol21 no 15 pp 3201ndash3212 2005

[53] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[54] M P Forum ldquoMPI amessage-passing interface standardrdquo TechRep University of Tennessee Knoxville Tenn USA 1994

[55] Q Zou X-B Li W-R Jiang Z-Y Lin G-L Li and K ChenldquoSurvey of MapReduce frame operation in bioinformaticsrdquoBriefings in Bioinformatics 2013

[56] N Malod-Dognin and N Przulj ldquoGR-Align fast and flexiblealignment of protein 3D structures using graphlet degreesimilarityrdquo Bioinformatics vol 30 no 9 pp 1259ndash1265 2014

[57] W Xie and N V Sahinidis ldquoA reduction-based exact algorithmfor the contact map overlap problemrdquo Journal of ComputationalBiology vol 14 no 5 pp 637ndash654 2007

[58] L P Chew and K Kedem ldquoFinding the consensus shape for aprotein family (extended abstract)rdquo in Proceedings of the 18thACM Annual Symposium on Computational Geometry (SCGrsquo02) pp 64ndash73 ACM Barcelona Spain June 2002

[59] D Fischer A Elofsson D Rice and D Eisenberg ldquoAssessingthe performance of fold recognition methods by means ofa comprehensive benchmarkrdquo in Proceedings of the PacificSymposium on Biocomputing pp 300ndash318 January 1996

[60] B Rost and C Sander ldquoPrediction of protein secondary struc-ture at better than 70 accuracyrdquo Journal of Molecular Biologyvol 232 no 2 pp 584ndash599 1993

[61] A Caprara R D Carr S Istrail G Lancia and B Walenzldquo1001 Optimal PDB structure alignments integer programmingmethods for finding the maximum contact map overlaprdquoJournal of Computational Biology vol 11 no 1 pp 27ndash52 2004

[62] R Andonov N Yanev and N Malod-Dognin ldquoAn efficientlagrangian relaxation for the contact map overlap problemrdquo inAlgorithms in Bioinformatics 8th InternationalWorkshopWABI2008 Karlsruhe Germany September 15ndash19 2008 ProceedingsK A Crandall and J Lagergren Eds vol 5251 of Lecture Notesin Computer Science pp 162ndash173 Springer Berlin Germany2008

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 7: Research Article Efficient Multicriteria Protein Structure ...mark MPBT data. Legacy soware by default does not mark memory as MPBT and runs without modi cations. If the bit is set,

BioMed Research International 7

Table 5 Comparison of speedup and efficiency achieved by the load balancing schemes in processing the all-to-all MCPSC task vis-a-vie thesingle core (SCC) processing times The efficiency is calculated assuming 47 processing elements (one PE serves as master) All times are inseconds

Load balancing scheme Dataset CK34 Dataset RS119Time Speedup Efficiency Time Speedup Efficiency

Serial (1 SCC core) 9698 mdash mdash 166000 mdash mdashRandom 479 20 043 4765 35 074Greedy partitioning (sum) 409 24 050 4300 39 082Greedy partitioning (product) 396 28 059 4433 38 080Round-robin (sum sorted) 239 41 086 3930 42 090Round-robin (product sorted) 238 41 086 3930 42 090

would naturally result in little to no idle times for the slaveprocesses For static job partitioning determining job-to-coremapping is equivalent to constructing119873 equal partitionsfor a list of elements which is an NP-hard problem [50]In our case 119873 is the number of cores and is equal to 47when all SCC cores are used We investigated methods forPSC jobs partitioning based on the sum and the product ofthe lengths of the two proteins to be compared We havestatically partitioned the PSC jobs using as attribute the sumof lengths (product of lengths) of the paired proteins to becompared The partitions are generated using the LongestProcessing Time (LPT) algorithm the jobs are sorted indescending order by sum of lengths (product of lengths)and then assigned to the partition with the smallest totalrunning sumThese partitions will be referred to as ldquoGreedyrdquopartitions fromnow on Since the job partitions are processedon different cores this strategy is expected to balance theprocessing times by reducing idle times

Figure 3 shows the partitions generated using a randompartitioning and a greedy partitioning scheme based on thesum of lengths and the product of lengths as attributes Parti-tions created with the random scheme with equal number ofpair proteins per partition vary greatly in terms of the totalnormalized sum of the sum (or product) of the pair proteinlengths The size of this normalized sum is expected to beproportional to the overall processing time each partitionwillrequire when assigned to a core Therefore if each partitionof PSC tasks created with the random scheme is processedon a different core several cores will have significant idletimes at the end On the contrary partitions created with thegreedy scheme are almost equal as to the normalized sum ofthe sum (or product) of the pair protein lengths Hence ifeach partition of PSC tasks created by the greedy scheme isprocessed on a different processing element (PE) idle timesare expected to be minimized

We extended the framework introduced in [36] by incor-porating more PSC methods (CE and USM) in additionto TMalign and computing consensus scores in order toperform Multicriteria PSC We retain the master-slaves pro-cessing model in the extended framework the first availablecore runs the master process and all other cores run slaveprocesses The master loads the data and generates a list ofjobs each one involving a single pairwise Protein StructureComparison operation to be performed using a specifiedPSC method Due to the vastly different processing times

0163248

0163248

0163248

0 2 4 6 8 10 12

Cor

e Id

Random split CK34 dataset

0 05 1 15 2 25 3 35 4

0 05 1 15 2 25 3 35 4

Cor

e Id

Greedy (sum) split CK34 dataset

Cor

e Id

Greedy (prod) split CK34 dataset

Figure 3 The 47 partitions created from list of PSC tasks sortedrandomly or by the sum (product) of lengths of pair proteins Eachhorizontal line represents the sum of the normalized lengths of theprotein pairs assigned to that partition

of PSC operations when using different methods we allowthe master process (a) to make a choice between sequentialversus parallel job processing and (b) implement a loadbalancing scheme for the slavesThe first allows the master tosequentially process jobs of a PSC method when distributingthem which would actually result in an increased processingtime The second allows the master to reduce idle times inslave processes which perform PSC continuously receivingjobs from and returning the results to the master till theyreceive a terminate signal

The consensus value for a protein pair is calculated usingthe scores from the three PSC methods (a) the USM-metricfrom USM (b) the TM-score from TMalign and (c) the RootMean Square Distance (RMSD) from CE The USM-metricand RMSD are distances and have a zero value for identicalstructures The TM-score is therefore inverted since it is theonly one of the three which is a similarity metric At the endthe average of the three normalized dissimilarity scores istaken as the MCPSC score for the protein pair

In Table 5 we compare the performance of the master-slaves setup with one master and 47 slaves to that of a singlecore of the SCC and show the speedup and the efficiencyachieved using the CK34 and RS119 datasets TMalign andCE jobs were run under the master-slaves model while USM

8 BioMed Research International

Table 6 Speedup efficiency and throughput in pairwise-PSC tasks per second for performing MCPSC on an Intel Core i7 multicore CPUusing the CK34 and RS119 datasets

Threads Dataset CK34 Dataset RS119Speedup Efficiency Throughput Speedup Efficiency Throughput

1 100 100 492 100 100 3492 188 094 927 146 073 5093 204 068 1001 198 066 6904 254 063 1247 260 065 9085 270 054 1325 270 054 9426 271 045 1332 271 045 9457 274 039 1347 269 038 9388 264 033 1298 267 033 931

jobs were processed by the master due to the negligible timerequired by these jobs

The greedy static partitioning scheme improves on theperformance of the random partitioning scheme but it isinadequate for the SCC where the cost of data transferbetween cores is low This scheme would be of interestin clusters of workstations where interconnection networksare slow The greedy partitioning scheme could also bebeneficial on larger many-core processors where the mastermay become a bottleneck making off-line creation of batchesof PSC tasks essential In such a scenario however othertechniques such as work-stealing would need to be assessedbefore selecting the most appropriate method

The round-robin dynamic job assignment strategy out-performs the best offline partitioning scheme Furthermorethe sorted round-robin flawlessly balances out the cores pro-cessing times on both datasets making it the best approachfor distributing jobs to the cores of the SCC In this approachthe protein pairs are presorted in descending order of product(or sum) of pair protein lengths Specifically for the one-to-many PSC case we simply retrieve the database proteins indescending order of length and create pairwise PSC taskswiththe query protein For the many-to-many PSC case howeverwe need to determine the correct order of the proteins whenthe query proteins are received

The sorted round-robin strategy distributes the jobs mostefficiently for the proposed master-slaves setup on the IntelSCC with 47 slaves This setup achieves a 42-fold (40-fold)speedup for the RS119 (CK34) dataset as compared to a singlecore of the SCCThe speedup is almost linear which suggeststhat high efficiency can be achieved evenwhen themany-coreprocessor has more nodes A bigger many-core processorhowever may require using a hierarchy of masters to avoida bottleneck on a single master node In such a scenarioa combination of partitioning schemes and round-robinjob assignment could be used to distribute and minimizeidle times For instance on larger NoC processors a well-balanced solution may require the use of clusters of process-ing elements concurrently processing PSC jobs with differentmethods All PEs computing jobs of a PSC method wouldreceive jobs from a specific submaster All such submasterswould in turn receive jobs from a common global master

34 Comparison of Multicriteria PSC Performance on Multi-core and Many-Core CPUs In order to perform this exper-iment the MCPSC software framework was retargeted tothe Intel Core i7 multicore processor using openmp [51] toimplement multithreading This experiment allowed us toassess how the MCPSC problem scales with increasing num-ber of cores on a modern multicore CPU readily availablefor scientists and engineers This multicore software versionuses shared memory to replace the RCCE based messagepassing (recv and send) calls used in the many-core versiondeveloped for the Intel SCC As shown in Table 6 we observespeedup on this multicore processor when running all-to-allMCPSCwith the CK34 and RS119 datasets up to the 4 threadsconfigurations Thereafter a steep speedup drop (efficiencyloss) is observed

This is attributed to the fact that this is a quad-coreprocessor that implements Hyper Threading (HT) [41] Sinceall threads operate on exactly the same type of instruc-tions workload (SPMD) it cannot take advantage of eachcorersquos super scalar architecture that needs varied workloadplacement (eg integer and floating point arithmetic at thesame time) to show performance advantages when utilizingmore threads than cores Further an SMP Operating System(OS) like the one we run on our test system uses allsystem cores as resources for swapping threads in and outA thread executing on core 1198751198641 for some number of cyclesmay get swapped out and later resume execution on core1198751198642 Thus the thread cache on core 1198751198641 becomes useless andit gets a lot of cache misses when restarting on 1198751198642 ThisOS behavior impacts cache performance and reduces theoverall application performance In terms of pairwise-PSCtasks per second (pps) the Intel Core i7 achieves a through-put of 1347 pps (945 pps) on the CK34 (RS119) dataset ascompared to 553 pps (363 pps) achieved by the SCC Thisis due to the fact that the multicore CPU contains latestgeneration cores featuring a highly superior out-of-ordermicroarchitecture and clocked at almost 75x the frequencyof the SCC We believe that even more performance can beexploited from next generation multicore processors shouldthey start introducing hardware memory structures like theMessage Passing Buffer (MPB) of the SCC NoC processoralongside their cores for increased communication efficiency

BioMed Research International 9

Table 7 Comparison of the SCC and i7 CPU in terms of throughputdelivered on all-to-all MCPSC

Dataset Pairs SCCthroughput

i7throughput Ratio

Skolnick 1089 605 2792 462Chew-Kedem 1156 503 1329 264Fischer 4624 198 937 474Rost-Sander 12996 330 938 285Lancia 72361 mdash 22265 mdashProteus 76729 mdash 677 mdash

among them without resorting to shared memory and itsunavoidable locks or cache-coherency protocol overheads

Finally Table 7 shows the comparative performance ofthe SCC (with 48 cores) and the i7 (running 7 threads)on all-to-all MCPSC using several datasets It can be seenthat the i7 outperforms the SCC consistently in terms ofthroughput The two larger datasets Lancia and Proteuswere not processed on the SCC because of the limited per-core memory (512MB) which was not sufficient to load thedomain structure data for the full dataset A decrease inthroughput was observed as the size of the dataset increaseswith some exceptions Both the SCC and the i7 deliver lowerthan expected throughput (based on its size) on the Fischerdataset which we believe is due to the higher complexity ofthe dataset As evidenced by Table 2 the Fischer dataset hasdomains belonging to 5 times more SCOP Super Families ascompared to CK34 and also has the second highest medianlength of protein domains among all the datasets used in thiswork Conversely the throughput delivered by the i7 on theLancia dataset is significantly higher than that delivered forthe similar size (in terms of number of domains) Proteusdataset We attribute this difference to the large differencebetween themean lengths of domains in each dataset as notedin Table 2

35 Qualitative Analysis of MCPSC Results Based on theChew-Kedem Dataset The acceleration capabilities offeredby modern processors (many-core and multicore) enablesperforming all-to-all PSC using different methods on large-scale datasets and compare results For example it becomeseasy to perform a qualitative analysis on a set of proteindomains and use consensus score (GPrsquoS) to explore relationbetween the protein domains categorize them into bio-logically relevant clusters [48] and automate generation ofdatabases such as Structural Classification of Proteins (SCOP)[46]

The MCPSC comparison method outperforms the com-ponent PSCmethods (TMalign CE andUSM) and groups 34representative proteins fromfivefold families into biologicallysignificant clusters In Figure 4 each protein domain is usingthe format ldquodomainName foldFamiliesrdquo The tags of theldquofoldFamiliesrdquo belong to one of (i) tb TIM barrel (ii) gglobins (iii) ab alpha beta (iv) b all beta and (v) a all alphaprotein families The clusters obtained from the MCPSCmethod show that most domains are grouped correctly

according to their structural fold In comparison all thecomponent PSCmethods produce clusters in which there aremany wrongly grouped domains Details of domain clusterallocation of all four methods are included in the Supple-mentary Materials file (see Supplementary Material availableonline at httpdxdoiorg1011552015563674) dom clustxlswith clusters generated by USM containing the highestnumber of errors The quantitative analysis evaluating theRecall versus Precision tradeoff based on the 119865-measure [52]results in a value of 119865 = 091 for the MCPSC method whichis higher than TMalign (119865 = 082) CE (119865 = 071) andUSM (119865 = 062) Finally in the Supplementary Materialfile supplementary materialpdf weprovide clusters generatedby hclust [47] for the four PSC methods (see Figures S1S2 and S3) PSC scores generated for all the datasets for allPSCmethods are included in the SupplementaryMaterial filepsc scoreszip

4 Conclusions and Future Work

In the near future we will see a continued increase inthe amount of cores integrated on a single chip as well asincreased availability of commodity many-core processorsThis trend is already visible today in that off-the-shelf PCscontain processors with up to 8 cores server grade machinesoften contain multiple processors with up to 32 cores andmulticore processors are even appearing in average grademobile devices Further the success of GPUs Tilerarsquos archi-tecture and Intelrsquos initiative for integrating more and morecores on a single chip also attest to this trend The ubiquityand the advanced core architectures employed by multicoreCPUsmake it imperative to build software that can efficientlyutilize their processing power Our experiments show that amodern Core i7 is able to deliver high throughput for all-to-all MCPSC Utilizing the performance gain delivered bymul-ticoreCPUbecomes especially importantwhen considered incombination with distributed setups such as clusters whichmay already contain nodes with multiple cores

Both multicore and many-core architectures lend them-selves to the popular and powerful MapReduce parallelprogramming model [53] If the underlying communicationlibrariesmdashtypically built on top of Message Passing Interface(MPI) [54]mdashare available this highly popular approachcan be applied to large-scale MCPSC in a manner similarto its application in other domains of bioinformatics [55]While some implementation of MapReduce is available formulticore machines it must be noted that no MapReduceframework implementations are available for the SCC Onepossible approach for using MapReduce in MCPSC wouldbe to (a) determine pairwise PSC to be performed on eachPE (pre-Map) (b) run all PSC methods on all pairs on eachPE in parallel (Map) (c) collect PSC scores from all PEsto a designated PE (shuffle) and (d) combine and generatefinal MCPSC scores (Reduce) Movement of data for taskdistribution (pre-Map) intermediate data transfer performedby the ldquoshufflerdquo mechanism and collection of results ifrequired by the Reduce step would benefit from low cost ofinter-PE communication Load balancing methods similar to

10 BioMed Research International

6xi

a_tb

1jh

g_a

1m

ba_g

1flp

_g1

ith_g

5m

bn_g

1m

yt_g

1ec

a_g

2hb

g_g

1ba

b_g

1as

h_g

3sd

h_g

2lh

b_g

1hl

b_g

1hl

m_g

2vh

b_g

1lh2

_g2

mnr

_tb

1ch

r_tb

4en

l_tb

5p2

1_a

b1

qra_

ab1

gnp_

ab6

q21

_ab

1aa9

_ab

1qf

o_b

1ne

u_b

1cd8

_b1

qa9

_b1

hnf_

b1

cdb_

b1

ci5

_b1

ct9

_ab

1cn

p_a

00

01

02

03

04

05

06

MCPSC

Figure 4 Hierarchical clustering result using the Chew-Kedem dataset and MCPSC score as distance metric between domains Each boxrepresents a cluster and the domains belonging to it The average linkage method was used to build the dendrogram

those discussed in this work could be useful in determiningeffective strategies for the distribution of PSC tasks (pre-Map) Analysis of MCPSC performance on multicore andmany-core processors withMapReduce would be required tounderstand the design trade-offs that would lead to efficientdesigns

Scalability limitation of the bus-based architecture cou-pled with the greater potential of Network-on-Chip basedprocessors to deliver efficiency and high performance impliesthat NoCs are likely to be used extensively in future many-core processors It is therefore important to start developingsoftware frameworks and solutions that capitalize on thisparallel architecture to meet the increasing computationaldemands in structural proteomics and bioinformatics ingeneral Comparison of many-core and multicore processorsin this work shows that the former can deliver higherefficiency than the latter processor architectures Furtherour experimental results make it clear that Intelrsquos SCC NoCmatches the speedup and efficiency achieved by a clusterof faster workstations [14] However a market ready NoCCPU will have a much better per watt performance ascompared to a cluster while also saving in space and infras-tructuremanagement costs Additionally any savings in wattsconsumed also reflect in savings in cooling infrastructurerequired for the hardware Furthermore with the per wattperformance being in focus for processor manufacturerssuch as Intel and Tilera this gap is set to expand evenfurther Finally many-core processors when used as the

CPUs of near-future clusters of workstations will providean additional level of local parallelism to applications whereperformance scalability is essential to tackle ever increasingcomputational demands As this study demonstrates thelikely increase in the availability and ubiquitousness of many-core processors the near linear speedup in tackling theMCPSC scenario and the ease of porting new PSC methodsto NoC based processors make many-core processors be ofgreat interest for the high performance structural proteomicsand bioinformatics communities in general

Incorporating additional PSC methods to a many-coreNoC processor is straightforward due to the familiar pro-gramming model and architecture of the individual coresOur rckskel library [36] offers convenience and parallel pro-gramming functions to further accelerate development on theNoC On future larger NoC processors optimal solutionsmayrequire the use of clusters of cores concurrently processingPSC jobs with different methods Using the rckskel librarywould facilitate the software development for such morecomplicated scenarios hiding low level details from the usersThe advantages offered by an algorithmic skeletons librarylike the rckskel make it of interest to further develop thelibrary To this end we intend to introduce communicationcontrols that will allow true blocking implementations ofldquosendrdquo and ldquorecvrdquo regardless of the communication backendlibrary used (RCCE eg provides busy-wait loops for block-ing) Such an implementation would increase the energy effi-ciency of applications built using the library Further work on

BioMed Research International 11

the librarywould involve separating the algorithmic skeletonscode from the communication subsystem so that it can beretargeted to other processors This would essentially requiresand-boxing the RCCE calls behindwell-structuredAPIs thatcan allow a different communication library to be plugged in

Further we would like to investigate the performancecharacteristics of all-to-all MCPSC on multicore and many-core processors with very large datasets Experimenting withsuch large datasets like the one used in [56] with more than6000 domains is likely to provide higher granularity andassist in more accurate assessment of the impact of differentload balancing schemes in terms of speedup delivered Addi-tionally cluster analysis on the results of such a large datasetwill provide a better measure of the biological relevance ofclusters generated by MCPSC A dataset of that size wouldgenerate more than 32 million pairwise comparison tasksper PSC method and keeping the number of methods to3 (as is the case in this work) would result in nearly 10million tasks Due to these large numbers processing sucha dataset specifically on a many-core processor would befeasible only with a larger NoC that is one which supportsmore memory and uses cores of newer architectures thanthe SCC Showing that such a large computation can becompleted in reasonable time would however further justifythe drive towards leveraging parallel processing capabilitiesof multicore and many-core processors

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Anuj Sharma contributed to the design of the study devel-oped the software framework and performed the experi-ments Elias S Manolakos conceived the study and con-tributed to the design and evaluation All authors contributedto the discussion of results andwrote and approved the paper

Acknowledgments

Mr Anuj Sharma and Dr Elias S Manolakos would like toacknowledge the support of this research by the EuropeanUnion (European Social FundESF) andGreek national fundsthrough the Operational Program ldquoEducation and LifelongLearningrdquo of the National Strategic Reference Framework(NSRF) Research Funding Program ldquoHeracleitus IIrdquo Theauthors would also like to acknowledge the MicroLab ECENational Technical University of Athens Greece for allowinguse of their SCC infrastructure In addition we thank MrDimitrios Rodopoulos and Professor Dimitrios Soudris ofNTUA for providing comments and assistance

References

[1] R A Laskowski J D Watson and J M Thornton ldquoProFunca server for predicting protein function from 3D structurerdquoNucleic Acids Research vol 33 no 2 pp 89ndash93 2005

[2] B Pang N Zhao M Becchi D Korkin and C-R ShyuldquoAccelerating large-scale protein structure alignments withgraphics processing unitsrdquo BMC Research Notes vol 5 no 1article 116 2012

[3] E Bielska X Lucas A Czerwiniec J M Kasprzak K HKaminska and J M Bujnicki ldquoVirtual screening strategies indrug design methods and applicationsrdquo Journal of Biotechnol-ogy Computational Biology and Bionanotechnology vol 92 no3 pp 249ndash264 2011

[4] H Kubinyi ldquoStructure-based design of enzyme inhibitors andreceptor ligandsrdquo Current Opinion in Drug Discovery andDevelopment vol 1 no 1 pp 4ndash15 1998

[5] G Lancia and S Istrail ldquoProtein structure comparison algo-rithms and applicationsrdquo in Mathematical Methods for ProteinStructure Analysis and Design vol 2666 of Lecture Notes inComputer Science pp 1ndash33 Springer Berlin Germany 2003

[6] I Eidhammer I Jonassen andW R Taylor ldquoStructure compar-ison and structure patternsrdquo Journal of Computational Biologyvol 7 no 5 pp 685ndash716 2000

[7] Y B Ruiz-Blanco W Paz J Green and Y Marrero-PonceldquoProtDCal a program to compute general-purpose-numericaldescriptors for sequences and 3D-structures of proteinsrdquo BMCBioinformatics vol 16 no 1 article 162 2015

[8] D Barthel J D Hirst J Błazewicz E K Burke and NKrasnogor ldquoProCKSI a decision support system for protein(structure) comparison knowledge similarity and informa-tionrdquo BMC Bioinformatics vol 8 article 416 2007

[9] M Arriagada and A Poleksic ldquoOn the difference in qualitybetween current heuristic and optimal solutions to the proteinstructure alignment problemrdquo BioMed Research Internationalvol 2013 Article ID 459248 8 pages 2013

[10] D E Robillard P T Mpangase S Hazelhurst and F DehneldquoSpeeDB fast structural protein searchesrdquo Bioinformatics vol31 no 18 pp 3027ndash3034 2015

[11] M Veeramalai D Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 no 1 article 138 2010

[12] S B Pandit and J Skolnick ldquoFr-TM-align a new proteinstructural alignment method based on fragment alignmentsand the TM-scorerdquo BMC Bioinformatics vol 9 no 1 article 5312008

[13] L Mavridis and D Ritchie ldquo3D-blast 3D protein structurealignment comparison and classification using spherical polarFourier correlationsrdquo in Proceedings of the Pacific Symposium onBiocomputing pp 281ndash292 World Scientific PublishingKamuela Hawaii USA January 2010 httphalinriafrinria-00434263en

[14] A A Shah G Folino and N Krasnogor ldquoToward high-throughput multicriteria protein-structure comparison andanalysisrdquo IEEE Transactions on Nanobioscience vol 9 no 2 pp144ndash155 2010

[15] L I Kuncheva and C J Whitaker ldquoMeasures of diversity inclassifier ensembles and their relationship with the ensembleaccuracyrdquo Machine Learning vol 51 no 2 pp 181ndash207 2003

[16] P Sollich and A Krogh ldquoLearning with ensembles howoverfitting can be usefulrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo96) vol 8 pp190ndash196 1996

12 BioMed Research International

[17] G Brown J Wyatt R Harris and X Yao ldquoDiversity creationmethods a survey and categorisationrdquo Information Fusion vol6 no 1 pp 5ndash20 2005

[18] P Chi Efficient protein tertiary structure retrievals and clas-sifications using content based comparison algorithms [PhDdissertation] University of Missouri at Columbia ColumbiaMo USA 2007

[19] A Poleksic ldquoAlgorithms for optimal protein structure align-mentrdquo Bioinformatics vol 25 no 21 pp 2751ndash2756 2009

[20] A A Shah Studies on distributed approaches for large scalemulti-criteria protein structure comparison and analysis [PhDthesis] University of Nottingham Nottingham UK 2011httpeprintsnottinghamacuk11735

[21] M Pharr and R Fernando GPU Gems 2 Programming Tech-niques for High-Performance Graphics and General-PurposeComputation Pearson Eduction Addison-Wesley Professional2005

[22] M Azimi N Cherukuri D Jayashima et al ldquoIntegrationchallenges and tradeoffs for tera-scale architecturesrdquo Intel Tech-nology Journal vol 11 pp 173ndash184 2007

[23] S Sarkar T Majumder A Kalyanaraman and P P PandeldquoHardware accelerators for biocomputing a surveyrdquo in Pro-ceedings of the IEEE International Symposium on Circuits andSystems (ISCAS rsquo10) pp 3789ndash3792 IEEE Paris France June2010

[24] E Kouskoumvekakis D Soudris and E S Manolakos ldquoMany-core CPUs can deliver scalable performance to stochasticsimulations of large-scale biochemical reaction networksrdquo inProceedings of the International Conference onHigh PerformanceComputing amp Simulation (HPCS rsquo15) pp 517ndash524 IEEE Ams-terdam The Netherlands July 2015

[25] A A Shah D Barthel and N Krasnogor ldquoGrid and distributedpublic computing schemes for structural proteomics a shortoverviewrdquo in Frontiers of High Performance Computing andNetworking ISPA 2007 Workshops vol 4743 pp 424ndash434Springer Berlin Germany 2007

[26] E Totoni B Behzad S Ghike and J Torrellas ldquoComparing thepower and performance of Intelrsquos SCC to state-of-the-art CPUsand GPUsrdquo in Proceedings of the IEEE International Symposiumon Performance Analysis of Systems amp Software (ISPASS rsquo12) pp78ndash87 IEEE Computer Society New Brunswick NJ USA April2012

[27] T Bjerregaard and S Mahadevan ldquoA survey of research andpractices of network-on-chiprdquoACMComputing Surveys vol 38no 1 article 1 Article ID 1132953 2006

[28] B Marker E Chan J Poulson et al ldquoProgramming many-corearchitecturesmdasha case study dense matrix computations on theIntel single-chip cloud computer processorrdquo Concurrency andComputation Practice and Experience 2011

[29] G Blake R G Dreslinski and TMudge ldquoA survey of multicoreprocessorsrdquo IEEE Signal Processing Magazine vol 26 no 6 pp26ndash37 2009

[30] P Guerrier and A Greiner ldquoA generic architecture for on-chip packet-switched interconnectionsrdquo in Proceedings of theConference on Design Automation and Test in Europe (DATErsquo00) pp 250ndash256 ACM Paris France 2000

[31] D Atienza F Angiolini S Murali A Pullini L Benini and GDe Micheli ldquoNetwork-on-chip design and synthesis outlookrdquoIntegration vol 41 no 2 pp 1ndash35 2008

[32] R P Mohanty A K Turuk and B Sahoo ldquoPerformanceevaluation of multi-core processors with varied interconnectnetworksrdquo in Proceedings of the 2nd International Conference onAdvanced Computing Networking and Security (ADCONS rsquo13)pp 7ndash11 IEEE Computer Society Mangalore India December2013

[33] S Isaza Multicore architectures for bioinformatics applications[PhD thesis] University of Lugano Lugano Switzerland 2011

[34] S B Needleman and C D Wunsch ldquoA general method appli-cable to the search for similarities in the amino acid sequenceof two proteinsrdquo Journal of Molecular Biology vol 48 no 3 pp443ndash453 1970

[35] S Sarkar G R Kulkarni P P Pande and A Kalyanara-man ldquoNetwork-on-chip hardware accelerators for biologicalsequence alignmentrdquo IEEE Transactions on Computers vol 59no 1 pp 29ndash41 2010

[36] A Sharma A Papanikolaou and E SManolakos ldquoAcceleratingall-to-all protein structures comparison with TMalign usinga NoC many-cores processor architecturerdquo in Proceedings ofthe 27th IEEE International Parallel and Distributed ProcessingSymposium Workshops amp PhD Forum (IPDPSW rsquo13) pp 510ndash519 IEEE Cambridge Mass USA May 2013

[37] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[38] N Krasnogor and D A Pelta ldquoMeasuring the similarity ofprotein structures by means of the universal similarity metricrdquoBioinformatics vol 20 no 7 pp 1015ndash1021 2004

[39] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[40] N Melot K Avdic C Kessler and J Keller ldquoInvestigationof main memory bandwidth on Intel Single-Chip Cloudcomputerrdquo in Proceedings of the 3rd Many-Core ApplicationsResearch Community Symposium (MARC rsquo11) pp 107ndash110Ettlingen Germany July 2011

[41] S Saini H Jin R Hood D Barker P Mehrotra and RBiswas ldquoThe impact of hyper-threading on processor resourceutilization in production applicationsrdquo in Proceedings of the18th International Conference on High Performance Computing(HiPC rsquo11) pp 1ndash10 Bangalore India December 2011

[42] H Gonzalez-Velez and M Leyton ldquoA survey of algorithmicskeleton frameworks high-level structured parallel program-ming enablersrdquo Software Practice and Experience vol 40 no12 pp 1135ndash1160 2010

[43] D K G Campbell ldquoTowards the classification of algorithmicskeletonsrdquo Tech Rep YCS 276 1996

[44] A Silberschatz P B Galvin and G Gagne Operating SystemConcepts John Wiley amp Sons Wiley Publishing 8th edition2008

[45] D M Tax and R P W Duin ldquoFeature scaling in support vectordata descriptionsrdquo in Learning from Imbalanced Datasets NJapkowicz Ed pp 25ndash30 AAAI Press Menlo Park Calif USA2000

[46] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

BioMed Research International 13

[47] R Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing ViennaAustria 2015 httpwwwr-projectorg

[48] M Veeramalai D R Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 article 138 2010

[49] G Maccari G L B Spampinato and V Tozzini ldquoSecStAnTsecondary structure analysis tool for data selection statisticsand models buildingrdquo Bioinformatics vol 30 no 5 pp 668ndash674 2014

[50] S Mertens ldquoThe easiest hard problem number partitioningrdquoin Computational Complexity and Statistical Physics A PercusG Istrate and C Moore Eds pp 125ndash139 Oxford UniversityPress New York NY USA 2006

[51] OpenMP API for parallel programming version 40 httpopenmporgwp

[52] J Handl J D Knowles and D B Kell ldquoComputational clustervalidation in post-genomic data analysisrdquo Bioinformatics vol21 no 15 pp 3201ndash3212 2005

[53] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[54] M P Forum ldquoMPI amessage-passing interface standardrdquo TechRep University of Tennessee Knoxville Tenn USA 1994

[55] Q Zou X-B Li W-R Jiang Z-Y Lin G-L Li and K ChenldquoSurvey of MapReduce frame operation in bioinformaticsrdquoBriefings in Bioinformatics 2013

[56] N Malod-Dognin and N Przulj ldquoGR-Align fast and flexiblealignment of protein 3D structures using graphlet degreesimilarityrdquo Bioinformatics vol 30 no 9 pp 1259ndash1265 2014

[57] W Xie and N V Sahinidis ldquoA reduction-based exact algorithmfor the contact map overlap problemrdquo Journal of ComputationalBiology vol 14 no 5 pp 637ndash654 2007

[58] L P Chew and K Kedem ldquoFinding the consensus shape for aprotein family (extended abstract)rdquo in Proceedings of the 18thACM Annual Symposium on Computational Geometry (SCGrsquo02) pp 64ndash73 ACM Barcelona Spain June 2002

[59] D Fischer A Elofsson D Rice and D Eisenberg ldquoAssessingthe performance of fold recognition methods by means ofa comprehensive benchmarkrdquo in Proceedings of the PacificSymposium on Biocomputing pp 300ndash318 January 1996

[60] B Rost and C Sander ldquoPrediction of protein secondary struc-ture at better than 70 accuracyrdquo Journal of Molecular Biologyvol 232 no 2 pp 584ndash599 1993

[61] A Caprara R D Carr S Istrail G Lancia and B Walenzldquo1001 Optimal PDB structure alignments integer programmingmethods for finding the maximum contact map overlaprdquoJournal of Computational Biology vol 11 no 1 pp 27ndash52 2004

[62] R Andonov N Yanev and N Malod-Dognin ldquoAn efficientlagrangian relaxation for the contact map overlap problemrdquo inAlgorithms in Bioinformatics 8th InternationalWorkshopWABI2008 Karlsruhe Germany September 15ndash19 2008 ProceedingsK A Crandall and J Lagergren Eds vol 5251 of Lecture Notesin Computer Science pp 162ndash173 Springer Berlin Germany2008

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 8: Research Article Efficient Multicriteria Protein Structure ...mark MPBT data. Legacy soware by default does not mark memory as MPBT and runs without modi cations. If the bit is set,

8 BioMed Research International

Table 6 Speedup efficiency and throughput in pairwise-PSC tasks per second for performing MCPSC on an Intel Core i7 multicore CPUusing the CK34 and RS119 datasets

Threads Dataset CK34 Dataset RS119Speedup Efficiency Throughput Speedup Efficiency Throughput

1 100 100 492 100 100 3492 188 094 927 146 073 5093 204 068 1001 198 066 6904 254 063 1247 260 065 9085 270 054 1325 270 054 9426 271 045 1332 271 045 9457 274 039 1347 269 038 9388 264 033 1298 267 033 931

jobs were processed by the master due to the negligible timerequired by these jobs

The greedy static partitioning scheme improves on theperformance of the random partitioning scheme but it isinadequate for the SCC where the cost of data transferbetween cores is low This scheme would be of interestin clusters of workstations where interconnection networksare slow The greedy partitioning scheme could also bebeneficial on larger many-core processors where the mastermay become a bottleneck making off-line creation of batchesof PSC tasks essential In such a scenario however othertechniques such as work-stealing would need to be assessedbefore selecting the most appropriate method

The round-robin dynamic job assignment strategy out-performs the best offline partitioning scheme Furthermorethe sorted round-robin flawlessly balances out the cores pro-cessing times on both datasets making it the best approachfor distributing jobs to the cores of the SCC In this approachthe protein pairs are presorted in descending order of product(or sum) of pair protein lengths Specifically for the one-to-many PSC case we simply retrieve the database proteins indescending order of length and create pairwise PSC taskswiththe query protein For the many-to-many PSC case howeverwe need to determine the correct order of the proteins whenthe query proteins are received

The sorted round-robin strategy distributes the jobs mostefficiently for the proposed master-slaves setup on the IntelSCC with 47 slaves This setup achieves a 42-fold (40-fold)speedup for the RS119 (CK34) dataset as compared to a singlecore of the SCCThe speedup is almost linear which suggeststhat high efficiency can be achieved evenwhen themany-coreprocessor has more nodes A bigger many-core processorhowever may require using a hierarchy of masters to avoida bottleneck on a single master node In such a scenarioa combination of partitioning schemes and round-robinjob assignment could be used to distribute and minimizeidle times For instance on larger NoC processors a well-balanced solution may require the use of clusters of process-ing elements concurrently processing PSC jobs with differentmethods All PEs computing jobs of a PSC method wouldreceive jobs from a specific submaster All such submasterswould in turn receive jobs from a common global master

34 Comparison of Multicriteria PSC Performance on Multi-core and Many-Core CPUs In order to perform this exper-iment the MCPSC software framework was retargeted tothe Intel Core i7 multicore processor using openmp [51] toimplement multithreading This experiment allowed us toassess how the MCPSC problem scales with increasing num-ber of cores on a modern multicore CPU readily availablefor scientists and engineers This multicore software versionuses shared memory to replace the RCCE based messagepassing (recv and send) calls used in the many-core versiondeveloped for the Intel SCC As shown in Table 6 we observespeedup on this multicore processor when running all-to-allMCPSCwith the CK34 and RS119 datasets up to the 4 threadsconfigurations Thereafter a steep speedup drop (efficiencyloss) is observed

This is attributed to the fact that this is a quad-coreprocessor that implements Hyper Threading (HT) [41] Sinceall threads operate on exactly the same type of instruc-tions workload (SPMD) it cannot take advantage of eachcorersquos super scalar architecture that needs varied workloadplacement (eg integer and floating point arithmetic at thesame time) to show performance advantages when utilizingmore threads than cores Further an SMP Operating System(OS) like the one we run on our test system uses allsystem cores as resources for swapping threads in and outA thread executing on core 1198751198641 for some number of cyclesmay get swapped out and later resume execution on core1198751198642 Thus the thread cache on core 1198751198641 becomes useless andit gets a lot of cache misses when restarting on 1198751198642 ThisOS behavior impacts cache performance and reduces theoverall application performance In terms of pairwise-PSCtasks per second (pps) the Intel Core i7 achieves a through-put of 1347 pps (945 pps) on the CK34 (RS119) dataset ascompared to 553 pps (363 pps) achieved by the SCC Thisis due to the fact that the multicore CPU contains latestgeneration cores featuring a highly superior out-of-ordermicroarchitecture and clocked at almost 75x the frequencyof the SCC We believe that even more performance can beexploited from next generation multicore processors shouldthey start introducing hardware memory structures like theMessage Passing Buffer (MPB) of the SCC NoC processoralongside their cores for increased communication efficiency

BioMed Research International 9

Table 7 Comparison of the SCC and i7 CPU in terms of throughputdelivered on all-to-all MCPSC

Dataset Pairs SCCthroughput

i7throughput Ratio

Skolnick 1089 605 2792 462Chew-Kedem 1156 503 1329 264Fischer 4624 198 937 474Rost-Sander 12996 330 938 285Lancia 72361 mdash 22265 mdashProteus 76729 mdash 677 mdash

among them without resorting to shared memory and itsunavoidable locks or cache-coherency protocol overheads

Finally Table 7 shows the comparative performance ofthe SCC (with 48 cores) and the i7 (running 7 threads)on all-to-all MCPSC using several datasets It can be seenthat the i7 outperforms the SCC consistently in terms ofthroughput The two larger datasets Lancia and Proteuswere not processed on the SCC because of the limited per-core memory (512MB) which was not sufficient to load thedomain structure data for the full dataset A decrease inthroughput was observed as the size of the dataset increaseswith some exceptions Both the SCC and the i7 deliver lowerthan expected throughput (based on its size) on the Fischerdataset which we believe is due to the higher complexity ofthe dataset As evidenced by Table 2 the Fischer dataset hasdomains belonging to 5 times more SCOP Super Families ascompared to CK34 and also has the second highest medianlength of protein domains among all the datasets used in thiswork Conversely the throughput delivered by the i7 on theLancia dataset is significantly higher than that delivered forthe similar size (in terms of number of domains) Proteusdataset We attribute this difference to the large differencebetween themean lengths of domains in each dataset as notedin Table 2

35 Qualitative Analysis of MCPSC Results Based on theChew-Kedem Dataset The acceleration capabilities offeredby modern processors (many-core and multicore) enablesperforming all-to-all PSC using different methods on large-scale datasets and compare results For example it becomeseasy to perform a qualitative analysis on a set of proteindomains and use consensus score (GPrsquoS) to explore relationbetween the protein domains categorize them into bio-logically relevant clusters [48] and automate generation ofdatabases such as Structural Classification of Proteins (SCOP)[46]

The MCPSC comparison method outperforms the com-ponent PSCmethods (TMalign CE andUSM) and groups 34representative proteins fromfivefold families into biologicallysignificant clusters In Figure 4 each protein domain is usingthe format ldquodomainName foldFamiliesrdquo The tags of theldquofoldFamiliesrdquo belong to one of (i) tb TIM barrel (ii) gglobins (iii) ab alpha beta (iv) b all beta and (v) a all alphaprotein families The clusters obtained from the MCPSCmethod show that most domains are grouped correctly

according to their structural fold In comparison all thecomponent PSCmethods produce clusters in which there aremany wrongly grouped domains Details of domain clusterallocation of all four methods are included in the Supple-mentary Materials file (see Supplementary Material availableonline at httpdxdoiorg1011552015563674) dom clustxlswith clusters generated by USM containing the highestnumber of errors The quantitative analysis evaluating theRecall versus Precision tradeoff based on the 119865-measure [52]results in a value of 119865 = 091 for the MCPSC method whichis higher than TMalign (119865 = 082) CE (119865 = 071) andUSM (119865 = 062) Finally in the Supplementary Materialfile supplementary materialpdf weprovide clusters generatedby hclust [47] for the four PSC methods (see Figures S1S2 and S3) PSC scores generated for all the datasets for allPSCmethods are included in the SupplementaryMaterial filepsc scoreszip

4 Conclusions and Future Work

In the near future we will see a continued increase inthe amount of cores integrated on a single chip as well asincreased availability of commodity many-core processorsThis trend is already visible today in that off-the-shelf PCscontain processors with up to 8 cores server grade machinesoften contain multiple processors with up to 32 cores andmulticore processors are even appearing in average grademobile devices Further the success of GPUs Tilerarsquos archi-tecture and Intelrsquos initiative for integrating more and morecores on a single chip also attest to this trend The ubiquityand the advanced core architectures employed by multicoreCPUsmake it imperative to build software that can efficientlyutilize their processing power Our experiments show that amodern Core i7 is able to deliver high throughput for all-to-all MCPSC Utilizing the performance gain delivered bymul-ticoreCPUbecomes especially importantwhen considered incombination with distributed setups such as clusters whichmay already contain nodes with multiple cores

Both multicore and many-core architectures lend them-selves to the popular and powerful MapReduce parallelprogramming model [53] If the underlying communicationlibrariesmdashtypically built on top of Message Passing Interface(MPI) [54]mdashare available this highly popular approachcan be applied to large-scale MCPSC in a manner similarto its application in other domains of bioinformatics [55]While some implementation of MapReduce is available formulticore machines it must be noted that no MapReduceframework implementations are available for the SCC Onepossible approach for using MapReduce in MCPSC wouldbe to (a) determine pairwise PSC to be performed on eachPE (pre-Map) (b) run all PSC methods on all pairs on eachPE in parallel (Map) (c) collect PSC scores from all PEsto a designated PE (shuffle) and (d) combine and generatefinal MCPSC scores (Reduce) Movement of data for taskdistribution (pre-Map) intermediate data transfer performedby the ldquoshufflerdquo mechanism and collection of results ifrequired by the Reduce step would benefit from low cost ofinter-PE communication Load balancing methods similar to

10 BioMed Research International

6xi

a_tb

1jh

g_a

1m

ba_g

1flp

_g1

ith_g

5m

bn_g

1m

yt_g

1ec

a_g

2hb

g_g

1ba

b_g

1as

h_g

3sd

h_g

2lh

b_g

1hl

b_g

1hl

m_g

2vh

b_g

1lh2

_g2

mnr

_tb

1ch

r_tb

4en

l_tb

5p2

1_a

b1

qra_

ab1

gnp_

ab6

q21

_ab

1aa9

_ab

1qf

o_b

1ne

u_b

1cd8

_b1

qa9

_b1

hnf_

b1

cdb_

b1

ci5

_b1

ct9

_ab

1cn

p_a

00

01

02

03

04

05

06

MCPSC

Figure 4 Hierarchical clustering result using the Chew-Kedem dataset and MCPSC score as distance metric between domains Each boxrepresents a cluster and the domains belonging to it The average linkage method was used to build the dendrogram

those discussed in this work could be useful in determiningeffective strategies for the distribution of PSC tasks (pre-Map) Analysis of MCPSC performance on multicore andmany-core processors withMapReduce would be required tounderstand the design trade-offs that would lead to efficientdesigns

Scalability limitation of the bus-based architecture cou-pled with the greater potential of Network-on-Chip basedprocessors to deliver efficiency and high performance impliesthat NoCs are likely to be used extensively in future many-core processors It is therefore important to start developingsoftware frameworks and solutions that capitalize on thisparallel architecture to meet the increasing computationaldemands in structural proteomics and bioinformatics ingeneral Comparison of many-core and multicore processorsin this work shows that the former can deliver higherefficiency than the latter processor architectures Furtherour experimental results make it clear that Intelrsquos SCC NoCmatches the speedup and efficiency achieved by a clusterof faster workstations [14] However a market ready NoCCPU will have a much better per watt performance ascompared to a cluster while also saving in space and infras-tructuremanagement costs Additionally any savings in wattsconsumed also reflect in savings in cooling infrastructurerequired for the hardware Furthermore with the per wattperformance being in focus for processor manufacturerssuch as Intel and Tilera this gap is set to expand evenfurther Finally many-core processors when used as the

CPUs of near-future clusters of workstations will providean additional level of local parallelism to applications whereperformance scalability is essential to tackle ever increasingcomputational demands As this study demonstrates thelikely increase in the availability and ubiquitousness of many-core processors the near linear speedup in tackling theMCPSC scenario and the ease of porting new PSC methodsto NoC based processors make many-core processors be ofgreat interest for the high performance structural proteomicsand bioinformatics communities in general

Incorporating additional PSC methods to a many-coreNoC processor is straightforward due to the familiar pro-gramming model and architecture of the individual coresOur rckskel library [36] offers convenience and parallel pro-gramming functions to further accelerate development on theNoC On future larger NoC processors optimal solutionsmayrequire the use of clusters of cores concurrently processingPSC jobs with different methods Using the rckskel librarywould facilitate the software development for such morecomplicated scenarios hiding low level details from the usersThe advantages offered by an algorithmic skeletons librarylike the rckskel make it of interest to further develop thelibrary To this end we intend to introduce communicationcontrols that will allow true blocking implementations ofldquosendrdquo and ldquorecvrdquo regardless of the communication backendlibrary used (RCCE eg provides busy-wait loops for block-ing) Such an implementation would increase the energy effi-ciency of applications built using the library Further work on

BioMed Research International 11

the librarywould involve separating the algorithmic skeletonscode from the communication subsystem so that it can beretargeted to other processors This would essentially requiresand-boxing the RCCE calls behindwell-structuredAPIs thatcan allow a different communication library to be plugged in

Further we would like to investigate the performancecharacteristics of all-to-all MCPSC on multicore and many-core processors with very large datasets Experimenting withsuch large datasets like the one used in [56] with more than6000 domains is likely to provide higher granularity andassist in more accurate assessment of the impact of differentload balancing schemes in terms of speedup delivered Addi-tionally cluster analysis on the results of such a large datasetwill provide a better measure of the biological relevance ofclusters generated by MCPSC A dataset of that size wouldgenerate more than 32 million pairwise comparison tasksper PSC method and keeping the number of methods to3 (as is the case in this work) would result in nearly 10million tasks Due to these large numbers processing sucha dataset specifically on a many-core processor would befeasible only with a larger NoC that is one which supportsmore memory and uses cores of newer architectures thanthe SCC Showing that such a large computation can becompleted in reasonable time would however further justifythe drive towards leveraging parallel processing capabilitiesof multicore and many-core processors

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Anuj Sharma contributed to the design of the study devel-oped the software framework and performed the experi-ments Elias S Manolakos conceived the study and con-tributed to the design and evaluation All authors contributedto the discussion of results andwrote and approved the paper

Acknowledgments

Mr Anuj Sharma and Dr Elias S Manolakos would like toacknowledge the support of this research by the EuropeanUnion (European Social FundESF) andGreek national fundsthrough the Operational Program ldquoEducation and LifelongLearningrdquo of the National Strategic Reference Framework(NSRF) Research Funding Program ldquoHeracleitus IIrdquo Theauthors would also like to acknowledge the MicroLab ECENational Technical University of Athens Greece for allowinguse of their SCC infrastructure In addition we thank MrDimitrios Rodopoulos and Professor Dimitrios Soudris ofNTUA for providing comments and assistance

References

[1] R A Laskowski J D Watson and J M Thornton ldquoProFunca server for predicting protein function from 3D structurerdquoNucleic Acids Research vol 33 no 2 pp 89ndash93 2005

[2] B Pang N Zhao M Becchi D Korkin and C-R ShyuldquoAccelerating large-scale protein structure alignments withgraphics processing unitsrdquo BMC Research Notes vol 5 no 1article 116 2012

[3] E Bielska X Lucas A Czerwiniec J M Kasprzak K HKaminska and J M Bujnicki ldquoVirtual screening strategies indrug design methods and applicationsrdquo Journal of Biotechnol-ogy Computational Biology and Bionanotechnology vol 92 no3 pp 249ndash264 2011

[4] H Kubinyi ldquoStructure-based design of enzyme inhibitors andreceptor ligandsrdquo Current Opinion in Drug Discovery andDevelopment vol 1 no 1 pp 4ndash15 1998

[5] G Lancia and S Istrail ldquoProtein structure comparison algo-rithms and applicationsrdquo in Mathematical Methods for ProteinStructure Analysis and Design vol 2666 of Lecture Notes inComputer Science pp 1ndash33 Springer Berlin Germany 2003

[6] I Eidhammer I Jonassen andW R Taylor ldquoStructure compar-ison and structure patternsrdquo Journal of Computational Biologyvol 7 no 5 pp 685ndash716 2000

[7] Y B Ruiz-Blanco W Paz J Green and Y Marrero-PonceldquoProtDCal a program to compute general-purpose-numericaldescriptors for sequences and 3D-structures of proteinsrdquo BMCBioinformatics vol 16 no 1 article 162 2015

[8] D Barthel J D Hirst J Błazewicz E K Burke and NKrasnogor ldquoProCKSI a decision support system for protein(structure) comparison knowledge similarity and informa-tionrdquo BMC Bioinformatics vol 8 article 416 2007

[9] M Arriagada and A Poleksic ldquoOn the difference in qualitybetween current heuristic and optimal solutions to the proteinstructure alignment problemrdquo BioMed Research Internationalvol 2013 Article ID 459248 8 pages 2013

[10] D E Robillard P T Mpangase S Hazelhurst and F DehneldquoSpeeDB fast structural protein searchesrdquo Bioinformatics vol31 no 18 pp 3027ndash3034 2015

[11] M Veeramalai D Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 no 1 article 138 2010

[12] S B Pandit and J Skolnick ldquoFr-TM-align a new proteinstructural alignment method based on fragment alignmentsand the TM-scorerdquo BMC Bioinformatics vol 9 no 1 article 5312008

[13] L Mavridis and D Ritchie ldquo3D-blast 3D protein structurealignment comparison and classification using spherical polarFourier correlationsrdquo in Proceedings of the Pacific Symposium onBiocomputing pp 281ndash292 World Scientific PublishingKamuela Hawaii USA January 2010 httphalinriafrinria-00434263en

[14] A A Shah G Folino and N Krasnogor ldquoToward high-throughput multicriteria protein-structure comparison andanalysisrdquo IEEE Transactions on Nanobioscience vol 9 no 2 pp144ndash155 2010

[15] L I Kuncheva and C J Whitaker ldquoMeasures of diversity inclassifier ensembles and their relationship with the ensembleaccuracyrdquo Machine Learning vol 51 no 2 pp 181ndash207 2003

[16] P Sollich and A Krogh ldquoLearning with ensembles howoverfitting can be usefulrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo96) vol 8 pp190ndash196 1996

12 BioMed Research International

[17] G Brown J Wyatt R Harris and X Yao ldquoDiversity creationmethods a survey and categorisationrdquo Information Fusion vol6 no 1 pp 5ndash20 2005

[18] P Chi Efficient protein tertiary structure retrievals and clas-sifications using content based comparison algorithms [PhDdissertation] University of Missouri at Columbia ColumbiaMo USA 2007

[19] A Poleksic ldquoAlgorithms for optimal protein structure align-mentrdquo Bioinformatics vol 25 no 21 pp 2751ndash2756 2009

[20] A A Shah Studies on distributed approaches for large scalemulti-criteria protein structure comparison and analysis [PhDthesis] University of Nottingham Nottingham UK 2011httpeprintsnottinghamacuk11735

[21] M Pharr and R Fernando GPU Gems 2 Programming Tech-niques for High-Performance Graphics and General-PurposeComputation Pearson Eduction Addison-Wesley Professional2005

[22] M Azimi N Cherukuri D Jayashima et al ldquoIntegrationchallenges and tradeoffs for tera-scale architecturesrdquo Intel Tech-nology Journal vol 11 pp 173ndash184 2007

[23] S Sarkar T Majumder A Kalyanaraman and P P PandeldquoHardware accelerators for biocomputing a surveyrdquo in Pro-ceedings of the IEEE International Symposium on Circuits andSystems (ISCAS rsquo10) pp 3789ndash3792 IEEE Paris France June2010

[24] E Kouskoumvekakis D Soudris and E S Manolakos ldquoMany-core CPUs can deliver scalable performance to stochasticsimulations of large-scale biochemical reaction networksrdquo inProceedings of the International Conference onHigh PerformanceComputing amp Simulation (HPCS rsquo15) pp 517ndash524 IEEE Ams-terdam The Netherlands July 2015

[25] A A Shah D Barthel and N Krasnogor ldquoGrid and distributedpublic computing schemes for structural proteomics a shortoverviewrdquo in Frontiers of High Performance Computing andNetworking ISPA 2007 Workshops vol 4743 pp 424ndash434Springer Berlin Germany 2007

[26] E Totoni B Behzad S Ghike and J Torrellas ldquoComparing thepower and performance of Intelrsquos SCC to state-of-the-art CPUsand GPUsrdquo in Proceedings of the IEEE International Symposiumon Performance Analysis of Systems amp Software (ISPASS rsquo12) pp78ndash87 IEEE Computer Society New Brunswick NJ USA April2012

[27] T Bjerregaard and S Mahadevan ldquoA survey of research andpractices of network-on-chiprdquoACMComputing Surveys vol 38no 1 article 1 Article ID 1132953 2006

[28] B Marker E Chan J Poulson et al ldquoProgramming many-corearchitecturesmdasha case study dense matrix computations on theIntel single-chip cloud computer processorrdquo Concurrency andComputation Practice and Experience 2011

[29] G Blake R G Dreslinski and TMudge ldquoA survey of multicoreprocessorsrdquo IEEE Signal Processing Magazine vol 26 no 6 pp26ndash37 2009

[30] P Guerrier and A Greiner ldquoA generic architecture for on-chip packet-switched interconnectionsrdquo in Proceedings of theConference on Design Automation and Test in Europe (DATErsquo00) pp 250ndash256 ACM Paris France 2000

[31] D Atienza F Angiolini S Murali A Pullini L Benini and GDe Micheli ldquoNetwork-on-chip design and synthesis outlookrdquoIntegration vol 41 no 2 pp 1ndash35 2008

[32] R P Mohanty A K Turuk and B Sahoo ldquoPerformanceevaluation of multi-core processors with varied interconnectnetworksrdquo in Proceedings of the 2nd International Conference onAdvanced Computing Networking and Security (ADCONS rsquo13)pp 7ndash11 IEEE Computer Society Mangalore India December2013

[33] S Isaza Multicore architectures for bioinformatics applications[PhD thesis] University of Lugano Lugano Switzerland 2011

[34] S B Needleman and C D Wunsch ldquoA general method appli-cable to the search for similarities in the amino acid sequenceof two proteinsrdquo Journal of Molecular Biology vol 48 no 3 pp443ndash453 1970

[35] S Sarkar G R Kulkarni P P Pande and A Kalyanara-man ldquoNetwork-on-chip hardware accelerators for biologicalsequence alignmentrdquo IEEE Transactions on Computers vol 59no 1 pp 29ndash41 2010

[36] A Sharma A Papanikolaou and E SManolakos ldquoAcceleratingall-to-all protein structures comparison with TMalign usinga NoC many-cores processor architecturerdquo in Proceedings ofthe 27th IEEE International Parallel and Distributed ProcessingSymposium Workshops amp PhD Forum (IPDPSW rsquo13) pp 510ndash519 IEEE Cambridge Mass USA May 2013

[37] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[38] N Krasnogor and D A Pelta ldquoMeasuring the similarity ofprotein structures by means of the universal similarity metricrdquoBioinformatics vol 20 no 7 pp 1015ndash1021 2004

[39] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[40] N Melot K Avdic C Kessler and J Keller ldquoInvestigationof main memory bandwidth on Intel Single-Chip Cloudcomputerrdquo in Proceedings of the 3rd Many-Core ApplicationsResearch Community Symposium (MARC rsquo11) pp 107ndash110Ettlingen Germany July 2011

[41] S Saini H Jin R Hood D Barker P Mehrotra and RBiswas ldquoThe impact of hyper-threading on processor resourceutilization in production applicationsrdquo in Proceedings of the18th International Conference on High Performance Computing(HiPC rsquo11) pp 1ndash10 Bangalore India December 2011

[42] H Gonzalez-Velez and M Leyton ldquoA survey of algorithmicskeleton frameworks high-level structured parallel program-ming enablersrdquo Software Practice and Experience vol 40 no12 pp 1135ndash1160 2010

[43] D K G Campbell ldquoTowards the classification of algorithmicskeletonsrdquo Tech Rep YCS 276 1996

[44] A Silberschatz P B Galvin and G Gagne Operating SystemConcepts John Wiley amp Sons Wiley Publishing 8th edition2008

[45] D M Tax and R P W Duin ldquoFeature scaling in support vectordata descriptionsrdquo in Learning from Imbalanced Datasets NJapkowicz Ed pp 25ndash30 AAAI Press Menlo Park Calif USA2000

[46] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

BioMed Research International 13

[47] R Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing ViennaAustria 2015 httpwwwr-projectorg

[48] M Veeramalai D R Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 article 138 2010

[49] G Maccari G L B Spampinato and V Tozzini ldquoSecStAnTsecondary structure analysis tool for data selection statisticsand models buildingrdquo Bioinformatics vol 30 no 5 pp 668ndash674 2014

[50] S Mertens ldquoThe easiest hard problem number partitioningrdquoin Computational Complexity and Statistical Physics A PercusG Istrate and C Moore Eds pp 125ndash139 Oxford UniversityPress New York NY USA 2006

[51] OpenMP API for parallel programming version 40 httpopenmporgwp

[52] J Handl J D Knowles and D B Kell ldquoComputational clustervalidation in post-genomic data analysisrdquo Bioinformatics vol21 no 15 pp 3201ndash3212 2005

[53] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[54] M P Forum ldquoMPI amessage-passing interface standardrdquo TechRep University of Tennessee Knoxville Tenn USA 1994

[55] Q Zou X-B Li W-R Jiang Z-Y Lin G-L Li and K ChenldquoSurvey of MapReduce frame operation in bioinformaticsrdquoBriefings in Bioinformatics 2013

[56] N Malod-Dognin and N Przulj ldquoGR-Align fast and flexiblealignment of protein 3D structures using graphlet degreesimilarityrdquo Bioinformatics vol 30 no 9 pp 1259ndash1265 2014

[57] W Xie and N V Sahinidis ldquoA reduction-based exact algorithmfor the contact map overlap problemrdquo Journal of ComputationalBiology vol 14 no 5 pp 637ndash654 2007

[58] L P Chew and K Kedem ldquoFinding the consensus shape for aprotein family (extended abstract)rdquo in Proceedings of the 18thACM Annual Symposium on Computational Geometry (SCGrsquo02) pp 64ndash73 ACM Barcelona Spain June 2002

[59] D Fischer A Elofsson D Rice and D Eisenberg ldquoAssessingthe performance of fold recognition methods by means ofa comprehensive benchmarkrdquo in Proceedings of the PacificSymposium on Biocomputing pp 300ndash318 January 1996

[60] B Rost and C Sander ldquoPrediction of protein secondary struc-ture at better than 70 accuracyrdquo Journal of Molecular Biologyvol 232 no 2 pp 584ndash599 1993

[61] A Caprara R D Carr S Istrail G Lancia and B Walenzldquo1001 Optimal PDB structure alignments integer programmingmethods for finding the maximum contact map overlaprdquoJournal of Computational Biology vol 11 no 1 pp 27ndash52 2004

[62] R Andonov N Yanev and N Malod-Dognin ldquoAn efficientlagrangian relaxation for the contact map overlap problemrdquo inAlgorithms in Bioinformatics 8th InternationalWorkshopWABI2008 Karlsruhe Germany September 15ndash19 2008 ProceedingsK A Crandall and J Lagergren Eds vol 5251 of Lecture Notesin Computer Science pp 162ndash173 Springer Berlin Germany2008

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 9: Research Article Efficient Multicriteria Protein Structure ...mark MPBT data. Legacy soware by default does not mark memory as MPBT and runs without modi cations. If the bit is set,

BioMed Research International 9

Table 7 Comparison of the SCC and i7 CPU in terms of throughputdelivered on all-to-all MCPSC

Dataset Pairs SCCthroughput

i7throughput Ratio

Skolnick 1089 605 2792 462Chew-Kedem 1156 503 1329 264Fischer 4624 198 937 474Rost-Sander 12996 330 938 285Lancia 72361 mdash 22265 mdashProteus 76729 mdash 677 mdash

among them without resorting to shared memory and itsunavoidable locks or cache-coherency protocol overheads

Finally Table 7 shows the comparative performance ofthe SCC (with 48 cores) and the i7 (running 7 threads)on all-to-all MCPSC using several datasets It can be seenthat the i7 outperforms the SCC consistently in terms ofthroughput The two larger datasets Lancia and Proteuswere not processed on the SCC because of the limited per-core memory (512MB) which was not sufficient to load thedomain structure data for the full dataset A decrease inthroughput was observed as the size of the dataset increaseswith some exceptions Both the SCC and the i7 deliver lowerthan expected throughput (based on its size) on the Fischerdataset which we believe is due to the higher complexity ofthe dataset As evidenced by Table 2 the Fischer dataset hasdomains belonging to 5 times more SCOP Super Families ascompared to CK34 and also has the second highest medianlength of protein domains among all the datasets used in thiswork Conversely the throughput delivered by the i7 on theLancia dataset is significantly higher than that delivered forthe similar size (in terms of number of domains) Proteusdataset We attribute this difference to the large differencebetween themean lengths of domains in each dataset as notedin Table 2

35 Qualitative Analysis of MCPSC Results Based on theChew-Kedem Dataset The acceleration capabilities offeredby modern processors (many-core and multicore) enablesperforming all-to-all PSC using different methods on large-scale datasets and compare results For example it becomeseasy to perform a qualitative analysis on a set of proteindomains and use consensus score (GPrsquoS) to explore relationbetween the protein domains categorize them into bio-logically relevant clusters [48] and automate generation ofdatabases such as Structural Classification of Proteins (SCOP)[46]

The MCPSC comparison method outperforms the com-ponent PSCmethods (TMalign CE andUSM) and groups 34representative proteins fromfivefold families into biologicallysignificant clusters In Figure 4 each protein domain is usingthe format ldquodomainName foldFamiliesrdquo The tags of theldquofoldFamiliesrdquo belong to one of (i) tb TIM barrel (ii) gglobins (iii) ab alpha beta (iv) b all beta and (v) a all alphaprotein families The clusters obtained from the MCPSCmethod show that most domains are grouped correctly

according to their structural fold In comparison all thecomponent PSCmethods produce clusters in which there aremany wrongly grouped domains Details of domain clusterallocation of all four methods are included in the Supple-mentary Materials file (see Supplementary Material availableonline at httpdxdoiorg1011552015563674) dom clustxlswith clusters generated by USM containing the highestnumber of errors The quantitative analysis evaluating theRecall versus Precision tradeoff based on the 119865-measure [52]results in a value of 119865 = 091 for the MCPSC method whichis higher than TMalign (119865 = 082) CE (119865 = 071) andUSM (119865 = 062) Finally in the Supplementary Materialfile supplementary materialpdf weprovide clusters generatedby hclust [47] for the four PSC methods (see Figures S1S2 and S3) PSC scores generated for all the datasets for allPSCmethods are included in the SupplementaryMaterial filepsc scoreszip

4 Conclusions and Future Work

In the near future we will see a continued increase inthe amount of cores integrated on a single chip as well asincreased availability of commodity many-core processorsThis trend is already visible today in that off-the-shelf PCscontain processors with up to 8 cores server grade machinesoften contain multiple processors with up to 32 cores andmulticore processors are even appearing in average grademobile devices Further the success of GPUs Tilerarsquos archi-tecture and Intelrsquos initiative for integrating more and morecores on a single chip also attest to this trend The ubiquityand the advanced core architectures employed by multicoreCPUsmake it imperative to build software that can efficientlyutilize their processing power Our experiments show that amodern Core i7 is able to deliver high throughput for all-to-all MCPSC Utilizing the performance gain delivered bymul-ticoreCPUbecomes especially importantwhen considered incombination with distributed setups such as clusters whichmay already contain nodes with multiple cores

Both multicore and many-core architectures lend them-selves to the popular and powerful MapReduce parallelprogramming model [53] If the underlying communicationlibrariesmdashtypically built on top of Message Passing Interface(MPI) [54]mdashare available this highly popular approachcan be applied to large-scale MCPSC in a manner similarto its application in other domains of bioinformatics [55]While some implementation of MapReduce is available formulticore machines it must be noted that no MapReduceframework implementations are available for the SCC Onepossible approach for using MapReduce in MCPSC wouldbe to (a) determine pairwise PSC to be performed on eachPE (pre-Map) (b) run all PSC methods on all pairs on eachPE in parallel (Map) (c) collect PSC scores from all PEsto a designated PE (shuffle) and (d) combine and generatefinal MCPSC scores (Reduce) Movement of data for taskdistribution (pre-Map) intermediate data transfer performedby the ldquoshufflerdquo mechanism and collection of results ifrequired by the Reduce step would benefit from low cost ofinter-PE communication Load balancing methods similar to

10 BioMed Research International

6xi

a_tb

1jh

g_a

1m

ba_g

1flp

_g1

ith_g

5m

bn_g

1m

yt_g

1ec

a_g

2hb

g_g

1ba

b_g

1as

h_g

3sd

h_g

2lh

b_g

1hl

b_g

1hl

m_g

2vh

b_g

1lh2

_g2

mnr

_tb

1ch

r_tb

4en

l_tb

5p2

1_a

b1

qra_

ab1

gnp_

ab6

q21

_ab

1aa9

_ab

1qf

o_b

1ne

u_b

1cd8

_b1

qa9

_b1

hnf_

b1

cdb_

b1

ci5

_b1

ct9

_ab

1cn

p_a

00

01

02

03

04

05

06

MCPSC

Figure 4 Hierarchical clustering result using the Chew-Kedem dataset and MCPSC score as distance metric between domains Each boxrepresents a cluster and the domains belonging to it The average linkage method was used to build the dendrogram

those discussed in this work could be useful in determiningeffective strategies for the distribution of PSC tasks (pre-Map) Analysis of MCPSC performance on multicore andmany-core processors withMapReduce would be required tounderstand the design trade-offs that would lead to efficientdesigns

Scalability limitation of the bus-based architecture cou-pled with the greater potential of Network-on-Chip basedprocessors to deliver efficiency and high performance impliesthat NoCs are likely to be used extensively in future many-core processors It is therefore important to start developingsoftware frameworks and solutions that capitalize on thisparallel architecture to meet the increasing computationaldemands in structural proteomics and bioinformatics ingeneral Comparison of many-core and multicore processorsin this work shows that the former can deliver higherefficiency than the latter processor architectures Furtherour experimental results make it clear that Intelrsquos SCC NoCmatches the speedup and efficiency achieved by a clusterof faster workstations [14] However a market ready NoCCPU will have a much better per watt performance ascompared to a cluster while also saving in space and infras-tructuremanagement costs Additionally any savings in wattsconsumed also reflect in savings in cooling infrastructurerequired for the hardware Furthermore with the per wattperformance being in focus for processor manufacturerssuch as Intel and Tilera this gap is set to expand evenfurther Finally many-core processors when used as the

CPUs of near-future clusters of workstations will providean additional level of local parallelism to applications whereperformance scalability is essential to tackle ever increasingcomputational demands As this study demonstrates thelikely increase in the availability and ubiquitousness of many-core processors the near linear speedup in tackling theMCPSC scenario and the ease of porting new PSC methodsto NoC based processors make many-core processors be ofgreat interest for the high performance structural proteomicsand bioinformatics communities in general

Incorporating additional PSC methods to a many-coreNoC processor is straightforward due to the familiar pro-gramming model and architecture of the individual coresOur rckskel library [36] offers convenience and parallel pro-gramming functions to further accelerate development on theNoC On future larger NoC processors optimal solutionsmayrequire the use of clusters of cores concurrently processingPSC jobs with different methods Using the rckskel librarywould facilitate the software development for such morecomplicated scenarios hiding low level details from the usersThe advantages offered by an algorithmic skeletons librarylike the rckskel make it of interest to further develop thelibrary To this end we intend to introduce communicationcontrols that will allow true blocking implementations ofldquosendrdquo and ldquorecvrdquo regardless of the communication backendlibrary used (RCCE eg provides busy-wait loops for block-ing) Such an implementation would increase the energy effi-ciency of applications built using the library Further work on

BioMed Research International 11

the librarywould involve separating the algorithmic skeletonscode from the communication subsystem so that it can beretargeted to other processors This would essentially requiresand-boxing the RCCE calls behindwell-structuredAPIs thatcan allow a different communication library to be plugged in

Further we would like to investigate the performancecharacteristics of all-to-all MCPSC on multicore and many-core processors with very large datasets Experimenting withsuch large datasets like the one used in [56] with more than6000 domains is likely to provide higher granularity andassist in more accurate assessment of the impact of differentload balancing schemes in terms of speedup delivered Addi-tionally cluster analysis on the results of such a large datasetwill provide a better measure of the biological relevance ofclusters generated by MCPSC A dataset of that size wouldgenerate more than 32 million pairwise comparison tasksper PSC method and keeping the number of methods to3 (as is the case in this work) would result in nearly 10million tasks Due to these large numbers processing sucha dataset specifically on a many-core processor would befeasible only with a larger NoC that is one which supportsmore memory and uses cores of newer architectures thanthe SCC Showing that such a large computation can becompleted in reasonable time would however further justifythe drive towards leveraging parallel processing capabilitiesof multicore and many-core processors

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Anuj Sharma contributed to the design of the study devel-oped the software framework and performed the experi-ments Elias S Manolakos conceived the study and con-tributed to the design and evaluation All authors contributedto the discussion of results andwrote and approved the paper

Acknowledgments

Mr Anuj Sharma and Dr Elias S Manolakos would like toacknowledge the support of this research by the EuropeanUnion (European Social FundESF) andGreek national fundsthrough the Operational Program ldquoEducation and LifelongLearningrdquo of the National Strategic Reference Framework(NSRF) Research Funding Program ldquoHeracleitus IIrdquo Theauthors would also like to acknowledge the MicroLab ECENational Technical University of Athens Greece for allowinguse of their SCC infrastructure In addition we thank MrDimitrios Rodopoulos and Professor Dimitrios Soudris ofNTUA for providing comments and assistance

References

[1] R A Laskowski J D Watson and J M Thornton ldquoProFunca server for predicting protein function from 3D structurerdquoNucleic Acids Research vol 33 no 2 pp 89ndash93 2005

[2] B Pang N Zhao M Becchi D Korkin and C-R ShyuldquoAccelerating large-scale protein structure alignments withgraphics processing unitsrdquo BMC Research Notes vol 5 no 1article 116 2012

[3] E Bielska X Lucas A Czerwiniec J M Kasprzak K HKaminska and J M Bujnicki ldquoVirtual screening strategies indrug design methods and applicationsrdquo Journal of Biotechnol-ogy Computational Biology and Bionanotechnology vol 92 no3 pp 249ndash264 2011

[4] H Kubinyi ldquoStructure-based design of enzyme inhibitors andreceptor ligandsrdquo Current Opinion in Drug Discovery andDevelopment vol 1 no 1 pp 4ndash15 1998

[5] G Lancia and S Istrail ldquoProtein structure comparison algo-rithms and applicationsrdquo in Mathematical Methods for ProteinStructure Analysis and Design vol 2666 of Lecture Notes inComputer Science pp 1ndash33 Springer Berlin Germany 2003

[6] I Eidhammer I Jonassen andW R Taylor ldquoStructure compar-ison and structure patternsrdquo Journal of Computational Biologyvol 7 no 5 pp 685ndash716 2000

[7] Y B Ruiz-Blanco W Paz J Green and Y Marrero-PonceldquoProtDCal a program to compute general-purpose-numericaldescriptors for sequences and 3D-structures of proteinsrdquo BMCBioinformatics vol 16 no 1 article 162 2015

[8] D Barthel J D Hirst J Błazewicz E K Burke and NKrasnogor ldquoProCKSI a decision support system for protein(structure) comparison knowledge similarity and informa-tionrdquo BMC Bioinformatics vol 8 article 416 2007

[9] M Arriagada and A Poleksic ldquoOn the difference in qualitybetween current heuristic and optimal solutions to the proteinstructure alignment problemrdquo BioMed Research Internationalvol 2013 Article ID 459248 8 pages 2013

[10] D E Robillard P T Mpangase S Hazelhurst and F DehneldquoSpeeDB fast structural protein searchesrdquo Bioinformatics vol31 no 18 pp 3027ndash3034 2015

[11] M Veeramalai D Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 no 1 article 138 2010

[12] S B Pandit and J Skolnick ldquoFr-TM-align a new proteinstructural alignment method based on fragment alignmentsand the TM-scorerdquo BMC Bioinformatics vol 9 no 1 article 5312008

[13] L Mavridis and D Ritchie ldquo3D-blast 3D protein structurealignment comparison and classification using spherical polarFourier correlationsrdquo in Proceedings of the Pacific Symposium onBiocomputing pp 281ndash292 World Scientific PublishingKamuela Hawaii USA January 2010 httphalinriafrinria-00434263en

[14] A A Shah G Folino and N Krasnogor ldquoToward high-throughput multicriteria protein-structure comparison andanalysisrdquo IEEE Transactions on Nanobioscience vol 9 no 2 pp144ndash155 2010

[15] L I Kuncheva and C J Whitaker ldquoMeasures of diversity inclassifier ensembles and their relationship with the ensembleaccuracyrdquo Machine Learning vol 51 no 2 pp 181ndash207 2003

[16] P Sollich and A Krogh ldquoLearning with ensembles howoverfitting can be usefulrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo96) vol 8 pp190ndash196 1996

12 BioMed Research International

[17] G Brown J Wyatt R Harris and X Yao ldquoDiversity creationmethods a survey and categorisationrdquo Information Fusion vol6 no 1 pp 5ndash20 2005

[18] P Chi Efficient protein tertiary structure retrievals and clas-sifications using content based comparison algorithms [PhDdissertation] University of Missouri at Columbia ColumbiaMo USA 2007

[19] A Poleksic ldquoAlgorithms for optimal protein structure align-mentrdquo Bioinformatics vol 25 no 21 pp 2751ndash2756 2009

[20] A A Shah Studies on distributed approaches for large scalemulti-criteria protein structure comparison and analysis [PhDthesis] University of Nottingham Nottingham UK 2011httpeprintsnottinghamacuk11735

[21] M Pharr and R Fernando GPU Gems 2 Programming Tech-niques for High-Performance Graphics and General-PurposeComputation Pearson Eduction Addison-Wesley Professional2005

[22] M Azimi N Cherukuri D Jayashima et al ldquoIntegrationchallenges and tradeoffs for tera-scale architecturesrdquo Intel Tech-nology Journal vol 11 pp 173ndash184 2007

[23] S Sarkar T Majumder A Kalyanaraman and P P PandeldquoHardware accelerators for biocomputing a surveyrdquo in Pro-ceedings of the IEEE International Symposium on Circuits andSystems (ISCAS rsquo10) pp 3789ndash3792 IEEE Paris France June2010

[24] E Kouskoumvekakis D Soudris and E S Manolakos ldquoMany-core CPUs can deliver scalable performance to stochasticsimulations of large-scale biochemical reaction networksrdquo inProceedings of the International Conference onHigh PerformanceComputing amp Simulation (HPCS rsquo15) pp 517ndash524 IEEE Ams-terdam The Netherlands July 2015

[25] A A Shah D Barthel and N Krasnogor ldquoGrid and distributedpublic computing schemes for structural proteomics a shortoverviewrdquo in Frontiers of High Performance Computing andNetworking ISPA 2007 Workshops vol 4743 pp 424ndash434Springer Berlin Germany 2007

[26] E Totoni B Behzad S Ghike and J Torrellas ldquoComparing thepower and performance of Intelrsquos SCC to state-of-the-art CPUsand GPUsrdquo in Proceedings of the IEEE International Symposiumon Performance Analysis of Systems amp Software (ISPASS rsquo12) pp78ndash87 IEEE Computer Society New Brunswick NJ USA April2012

[27] T Bjerregaard and S Mahadevan ldquoA survey of research andpractices of network-on-chiprdquoACMComputing Surveys vol 38no 1 article 1 Article ID 1132953 2006

[28] B Marker E Chan J Poulson et al ldquoProgramming many-corearchitecturesmdasha case study dense matrix computations on theIntel single-chip cloud computer processorrdquo Concurrency andComputation Practice and Experience 2011

[29] G Blake R G Dreslinski and TMudge ldquoA survey of multicoreprocessorsrdquo IEEE Signal Processing Magazine vol 26 no 6 pp26ndash37 2009

[30] P Guerrier and A Greiner ldquoA generic architecture for on-chip packet-switched interconnectionsrdquo in Proceedings of theConference on Design Automation and Test in Europe (DATErsquo00) pp 250ndash256 ACM Paris France 2000

[31] D Atienza F Angiolini S Murali A Pullini L Benini and GDe Micheli ldquoNetwork-on-chip design and synthesis outlookrdquoIntegration vol 41 no 2 pp 1ndash35 2008

[32] R P Mohanty A K Turuk and B Sahoo ldquoPerformanceevaluation of multi-core processors with varied interconnectnetworksrdquo in Proceedings of the 2nd International Conference onAdvanced Computing Networking and Security (ADCONS rsquo13)pp 7ndash11 IEEE Computer Society Mangalore India December2013

[33] S Isaza Multicore architectures for bioinformatics applications[PhD thesis] University of Lugano Lugano Switzerland 2011

[34] S B Needleman and C D Wunsch ldquoA general method appli-cable to the search for similarities in the amino acid sequenceof two proteinsrdquo Journal of Molecular Biology vol 48 no 3 pp443ndash453 1970

[35] S Sarkar G R Kulkarni P P Pande and A Kalyanara-man ldquoNetwork-on-chip hardware accelerators for biologicalsequence alignmentrdquo IEEE Transactions on Computers vol 59no 1 pp 29ndash41 2010

[36] A Sharma A Papanikolaou and E SManolakos ldquoAcceleratingall-to-all protein structures comparison with TMalign usinga NoC many-cores processor architecturerdquo in Proceedings ofthe 27th IEEE International Parallel and Distributed ProcessingSymposium Workshops amp PhD Forum (IPDPSW rsquo13) pp 510ndash519 IEEE Cambridge Mass USA May 2013

[37] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[38] N Krasnogor and D A Pelta ldquoMeasuring the similarity ofprotein structures by means of the universal similarity metricrdquoBioinformatics vol 20 no 7 pp 1015ndash1021 2004

[39] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[40] N Melot K Avdic C Kessler and J Keller ldquoInvestigationof main memory bandwidth on Intel Single-Chip Cloudcomputerrdquo in Proceedings of the 3rd Many-Core ApplicationsResearch Community Symposium (MARC rsquo11) pp 107ndash110Ettlingen Germany July 2011

[41] S Saini H Jin R Hood D Barker P Mehrotra and RBiswas ldquoThe impact of hyper-threading on processor resourceutilization in production applicationsrdquo in Proceedings of the18th International Conference on High Performance Computing(HiPC rsquo11) pp 1ndash10 Bangalore India December 2011

[42] H Gonzalez-Velez and M Leyton ldquoA survey of algorithmicskeleton frameworks high-level structured parallel program-ming enablersrdquo Software Practice and Experience vol 40 no12 pp 1135ndash1160 2010

[43] D K G Campbell ldquoTowards the classification of algorithmicskeletonsrdquo Tech Rep YCS 276 1996

[44] A Silberschatz P B Galvin and G Gagne Operating SystemConcepts John Wiley amp Sons Wiley Publishing 8th edition2008

[45] D M Tax and R P W Duin ldquoFeature scaling in support vectordata descriptionsrdquo in Learning from Imbalanced Datasets NJapkowicz Ed pp 25ndash30 AAAI Press Menlo Park Calif USA2000

[46] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

BioMed Research International 13

[47] R Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing ViennaAustria 2015 httpwwwr-projectorg

[48] M Veeramalai D R Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 article 138 2010

[49] G Maccari G L B Spampinato and V Tozzini ldquoSecStAnTsecondary structure analysis tool for data selection statisticsand models buildingrdquo Bioinformatics vol 30 no 5 pp 668ndash674 2014

[50] S Mertens ldquoThe easiest hard problem number partitioningrdquoin Computational Complexity and Statistical Physics A PercusG Istrate and C Moore Eds pp 125ndash139 Oxford UniversityPress New York NY USA 2006

[51] OpenMP API for parallel programming version 40 httpopenmporgwp

[52] J Handl J D Knowles and D B Kell ldquoComputational clustervalidation in post-genomic data analysisrdquo Bioinformatics vol21 no 15 pp 3201ndash3212 2005

[53] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[54] M P Forum ldquoMPI amessage-passing interface standardrdquo TechRep University of Tennessee Knoxville Tenn USA 1994

[55] Q Zou X-B Li W-R Jiang Z-Y Lin G-L Li and K ChenldquoSurvey of MapReduce frame operation in bioinformaticsrdquoBriefings in Bioinformatics 2013

[56] N Malod-Dognin and N Przulj ldquoGR-Align fast and flexiblealignment of protein 3D structures using graphlet degreesimilarityrdquo Bioinformatics vol 30 no 9 pp 1259ndash1265 2014

[57] W Xie and N V Sahinidis ldquoA reduction-based exact algorithmfor the contact map overlap problemrdquo Journal of ComputationalBiology vol 14 no 5 pp 637ndash654 2007

[58] L P Chew and K Kedem ldquoFinding the consensus shape for aprotein family (extended abstract)rdquo in Proceedings of the 18thACM Annual Symposium on Computational Geometry (SCGrsquo02) pp 64ndash73 ACM Barcelona Spain June 2002

[59] D Fischer A Elofsson D Rice and D Eisenberg ldquoAssessingthe performance of fold recognition methods by means ofa comprehensive benchmarkrdquo in Proceedings of the PacificSymposium on Biocomputing pp 300ndash318 January 1996

[60] B Rost and C Sander ldquoPrediction of protein secondary struc-ture at better than 70 accuracyrdquo Journal of Molecular Biologyvol 232 no 2 pp 584ndash599 1993

[61] A Caprara R D Carr S Istrail G Lancia and B Walenzldquo1001 Optimal PDB structure alignments integer programmingmethods for finding the maximum contact map overlaprdquoJournal of Computational Biology vol 11 no 1 pp 27ndash52 2004

[62] R Andonov N Yanev and N Malod-Dognin ldquoAn efficientlagrangian relaxation for the contact map overlap problemrdquo inAlgorithms in Bioinformatics 8th InternationalWorkshopWABI2008 Karlsruhe Germany September 15ndash19 2008 ProceedingsK A Crandall and J Lagergren Eds vol 5251 of Lecture Notesin Computer Science pp 162ndash173 Springer Berlin Germany2008

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 10: Research Article Efficient Multicriteria Protein Structure ...mark MPBT data. Legacy soware by default does not mark memory as MPBT and runs without modi cations. If the bit is set,

10 BioMed Research International

6xi

a_tb

1jh

g_a

1m

ba_g

1flp

_g1

ith_g

5m

bn_g

1m

yt_g

1ec

a_g

2hb

g_g

1ba

b_g

1as

h_g

3sd

h_g

2lh

b_g

1hl

b_g

1hl

m_g

2vh

b_g

1lh2

_g2

mnr

_tb

1ch

r_tb

4en

l_tb

5p2

1_a

b1

qra_

ab1

gnp_

ab6

q21

_ab

1aa9

_ab

1qf

o_b

1ne

u_b

1cd8

_b1

qa9

_b1

hnf_

b1

cdb_

b1

ci5

_b1

ct9

_ab

1cn

p_a

00

01

02

03

04

05

06

MCPSC

Figure 4 Hierarchical clustering result using the Chew-Kedem dataset and MCPSC score as distance metric between domains Each boxrepresents a cluster and the domains belonging to it The average linkage method was used to build the dendrogram

those discussed in this work could be useful in determiningeffective strategies for the distribution of PSC tasks (pre-Map) Analysis of MCPSC performance on multicore andmany-core processors withMapReduce would be required tounderstand the design trade-offs that would lead to efficientdesigns

Scalability limitation of the bus-based architecture cou-pled with the greater potential of Network-on-Chip basedprocessors to deliver efficiency and high performance impliesthat NoCs are likely to be used extensively in future many-core processors It is therefore important to start developingsoftware frameworks and solutions that capitalize on thisparallel architecture to meet the increasing computationaldemands in structural proteomics and bioinformatics ingeneral Comparison of many-core and multicore processorsin this work shows that the former can deliver higherefficiency than the latter processor architectures Furtherour experimental results make it clear that Intelrsquos SCC NoCmatches the speedup and efficiency achieved by a clusterof faster workstations [14] However a market ready NoCCPU will have a much better per watt performance ascompared to a cluster while also saving in space and infras-tructuremanagement costs Additionally any savings in wattsconsumed also reflect in savings in cooling infrastructurerequired for the hardware Furthermore with the per wattperformance being in focus for processor manufacturerssuch as Intel and Tilera this gap is set to expand evenfurther Finally many-core processors when used as the

CPUs of near-future clusters of workstations will providean additional level of local parallelism to applications whereperformance scalability is essential to tackle ever increasingcomputational demands As this study demonstrates thelikely increase in the availability and ubiquitousness of many-core processors the near linear speedup in tackling theMCPSC scenario and the ease of porting new PSC methodsto NoC based processors make many-core processors be ofgreat interest for the high performance structural proteomicsand bioinformatics communities in general

Incorporating additional PSC methods to a many-coreNoC processor is straightforward due to the familiar pro-gramming model and architecture of the individual coresOur rckskel library [36] offers convenience and parallel pro-gramming functions to further accelerate development on theNoC On future larger NoC processors optimal solutionsmayrequire the use of clusters of cores concurrently processingPSC jobs with different methods Using the rckskel librarywould facilitate the software development for such morecomplicated scenarios hiding low level details from the usersThe advantages offered by an algorithmic skeletons librarylike the rckskel make it of interest to further develop thelibrary To this end we intend to introduce communicationcontrols that will allow true blocking implementations ofldquosendrdquo and ldquorecvrdquo regardless of the communication backendlibrary used (RCCE eg provides busy-wait loops for block-ing) Such an implementation would increase the energy effi-ciency of applications built using the library Further work on

BioMed Research International 11

the librarywould involve separating the algorithmic skeletonscode from the communication subsystem so that it can beretargeted to other processors This would essentially requiresand-boxing the RCCE calls behindwell-structuredAPIs thatcan allow a different communication library to be plugged in

Further we would like to investigate the performancecharacteristics of all-to-all MCPSC on multicore and many-core processors with very large datasets Experimenting withsuch large datasets like the one used in [56] with more than6000 domains is likely to provide higher granularity andassist in more accurate assessment of the impact of differentload balancing schemes in terms of speedup delivered Addi-tionally cluster analysis on the results of such a large datasetwill provide a better measure of the biological relevance ofclusters generated by MCPSC A dataset of that size wouldgenerate more than 32 million pairwise comparison tasksper PSC method and keeping the number of methods to3 (as is the case in this work) would result in nearly 10million tasks Due to these large numbers processing sucha dataset specifically on a many-core processor would befeasible only with a larger NoC that is one which supportsmore memory and uses cores of newer architectures thanthe SCC Showing that such a large computation can becompleted in reasonable time would however further justifythe drive towards leveraging parallel processing capabilitiesof multicore and many-core processors

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Anuj Sharma contributed to the design of the study devel-oped the software framework and performed the experi-ments Elias S Manolakos conceived the study and con-tributed to the design and evaluation All authors contributedto the discussion of results andwrote and approved the paper

Acknowledgments

Mr Anuj Sharma and Dr Elias S Manolakos would like toacknowledge the support of this research by the EuropeanUnion (European Social FundESF) andGreek national fundsthrough the Operational Program ldquoEducation and LifelongLearningrdquo of the National Strategic Reference Framework(NSRF) Research Funding Program ldquoHeracleitus IIrdquo Theauthors would also like to acknowledge the MicroLab ECENational Technical University of Athens Greece for allowinguse of their SCC infrastructure In addition we thank MrDimitrios Rodopoulos and Professor Dimitrios Soudris ofNTUA for providing comments and assistance

References

[1] R A Laskowski J D Watson and J M Thornton ldquoProFunca server for predicting protein function from 3D structurerdquoNucleic Acids Research vol 33 no 2 pp 89ndash93 2005

[2] B Pang N Zhao M Becchi D Korkin and C-R ShyuldquoAccelerating large-scale protein structure alignments withgraphics processing unitsrdquo BMC Research Notes vol 5 no 1article 116 2012

[3] E Bielska X Lucas A Czerwiniec J M Kasprzak K HKaminska and J M Bujnicki ldquoVirtual screening strategies indrug design methods and applicationsrdquo Journal of Biotechnol-ogy Computational Biology and Bionanotechnology vol 92 no3 pp 249ndash264 2011

[4] H Kubinyi ldquoStructure-based design of enzyme inhibitors andreceptor ligandsrdquo Current Opinion in Drug Discovery andDevelopment vol 1 no 1 pp 4ndash15 1998

[5] G Lancia and S Istrail ldquoProtein structure comparison algo-rithms and applicationsrdquo in Mathematical Methods for ProteinStructure Analysis and Design vol 2666 of Lecture Notes inComputer Science pp 1ndash33 Springer Berlin Germany 2003

[6] I Eidhammer I Jonassen andW R Taylor ldquoStructure compar-ison and structure patternsrdquo Journal of Computational Biologyvol 7 no 5 pp 685ndash716 2000

[7] Y B Ruiz-Blanco W Paz J Green and Y Marrero-PonceldquoProtDCal a program to compute general-purpose-numericaldescriptors for sequences and 3D-structures of proteinsrdquo BMCBioinformatics vol 16 no 1 article 162 2015

[8] D Barthel J D Hirst J Błazewicz E K Burke and NKrasnogor ldquoProCKSI a decision support system for protein(structure) comparison knowledge similarity and informa-tionrdquo BMC Bioinformatics vol 8 article 416 2007

[9] M Arriagada and A Poleksic ldquoOn the difference in qualitybetween current heuristic and optimal solutions to the proteinstructure alignment problemrdquo BioMed Research Internationalvol 2013 Article ID 459248 8 pages 2013

[10] D E Robillard P T Mpangase S Hazelhurst and F DehneldquoSpeeDB fast structural protein searchesrdquo Bioinformatics vol31 no 18 pp 3027ndash3034 2015

[11] M Veeramalai D Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 no 1 article 138 2010

[12] S B Pandit and J Skolnick ldquoFr-TM-align a new proteinstructural alignment method based on fragment alignmentsand the TM-scorerdquo BMC Bioinformatics vol 9 no 1 article 5312008

[13] L Mavridis and D Ritchie ldquo3D-blast 3D protein structurealignment comparison and classification using spherical polarFourier correlationsrdquo in Proceedings of the Pacific Symposium onBiocomputing pp 281ndash292 World Scientific PublishingKamuela Hawaii USA January 2010 httphalinriafrinria-00434263en

[14] A A Shah G Folino and N Krasnogor ldquoToward high-throughput multicriteria protein-structure comparison andanalysisrdquo IEEE Transactions on Nanobioscience vol 9 no 2 pp144ndash155 2010

[15] L I Kuncheva and C J Whitaker ldquoMeasures of diversity inclassifier ensembles and their relationship with the ensembleaccuracyrdquo Machine Learning vol 51 no 2 pp 181ndash207 2003

[16] P Sollich and A Krogh ldquoLearning with ensembles howoverfitting can be usefulrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo96) vol 8 pp190ndash196 1996

12 BioMed Research International

[17] G Brown J Wyatt R Harris and X Yao ldquoDiversity creationmethods a survey and categorisationrdquo Information Fusion vol6 no 1 pp 5ndash20 2005

[18] P Chi Efficient protein tertiary structure retrievals and clas-sifications using content based comparison algorithms [PhDdissertation] University of Missouri at Columbia ColumbiaMo USA 2007

[19] A Poleksic ldquoAlgorithms for optimal protein structure align-mentrdquo Bioinformatics vol 25 no 21 pp 2751ndash2756 2009

[20] A A Shah Studies on distributed approaches for large scalemulti-criteria protein structure comparison and analysis [PhDthesis] University of Nottingham Nottingham UK 2011httpeprintsnottinghamacuk11735

[21] M Pharr and R Fernando GPU Gems 2 Programming Tech-niques for High-Performance Graphics and General-PurposeComputation Pearson Eduction Addison-Wesley Professional2005

[22] M Azimi N Cherukuri D Jayashima et al ldquoIntegrationchallenges and tradeoffs for tera-scale architecturesrdquo Intel Tech-nology Journal vol 11 pp 173ndash184 2007

[23] S Sarkar T Majumder A Kalyanaraman and P P PandeldquoHardware accelerators for biocomputing a surveyrdquo in Pro-ceedings of the IEEE International Symposium on Circuits andSystems (ISCAS rsquo10) pp 3789ndash3792 IEEE Paris France June2010

[24] E Kouskoumvekakis D Soudris and E S Manolakos ldquoMany-core CPUs can deliver scalable performance to stochasticsimulations of large-scale biochemical reaction networksrdquo inProceedings of the International Conference onHigh PerformanceComputing amp Simulation (HPCS rsquo15) pp 517ndash524 IEEE Ams-terdam The Netherlands July 2015

[25] A A Shah D Barthel and N Krasnogor ldquoGrid and distributedpublic computing schemes for structural proteomics a shortoverviewrdquo in Frontiers of High Performance Computing andNetworking ISPA 2007 Workshops vol 4743 pp 424ndash434Springer Berlin Germany 2007

[26] E Totoni B Behzad S Ghike and J Torrellas ldquoComparing thepower and performance of Intelrsquos SCC to state-of-the-art CPUsand GPUsrdquo in Proceedings of the IEEE International Symposiumon Performance Analysis of Systems amp Software (ISPASS rsquo12) pp78ndash87 IEEE Computer Society New Brunswick NJ USA April2012

[27] T Bjerregaard and S Mahadevan ldquoA survey of research andpractices of network-on-chiprdquoACMComputing Surveys vol 38no 1 article 1 Article ID 1132953 2006

[28] B Marker E Chan J Poulson et al ldquoProgramming many-corearchitecturesmdasha case study dense matrix computations on theIntel single-chip cloud computer processorrdquo Concurrency andComputation Practice and Experience 2011

[29] G Blake R G Dreslinski and TMudge ldquoA survey of multicoreprocessorsrdquo IEEE Signal Processing Magazine vol 26 no 6 pp26ndash37 2009

[30] P Guerrier and A Greiner ldquoA generic architecture for on-chip packet-switched interconnectionsrdquo in Proceedings of theConference on Design Automation and Test in Europe (DATErsquo00) pp 250ndash256 ACM Paris France 2000

[31] D Atienza F Angiolini S Murali A Pullini L Benini and GDe Micheli ldquoNetwork-on-chip design and synthesis outlookrdquoIntegration vol 41 no 2 pp 1ndash35 2008

[32] R P Mohanty A K Turuk and B Sahoo ldquoPerformanceevaluation of multi-core processors with varied interconnectnetworksrdquo in Proceedings of the 2nd International Conference onAdvanced Computing Networking and Security (ADCONS rsquo13)pp 7ndash11 IEEE Computer Society Mangalore India December2013

[33] S Isaza Multicore architectures for bioinformatics applications[PhD thesis] University of Lugano Lugano Switzerland 2011

[34] S B Needleman and C D Wunsch ldquoA general method appli-cable to the search for similarities in the amino acid sequenceof two proteinsrdquo Journal of Molecular Biology vol 48 no 3 pp443ndash453 1970

[35] S Sarkar G R Kulkarni P P Pande and A Kalyanara-man ldquoNetwork-on-chip hardware accelerators for biologicalsequence alignmentrdquo IEEE Transactions on Computers vol 59no 1 pp 29ndash41 2010

[36] A Sharma A Papanikolaou and E SManolakos ldquoAcceleratingall-to-all protein structures comparison with TMalign usinga NoC many-cores processor architecturerdquo in Proceedings ofthe 27th IEEE International Parallel and Distributed ProcessingSymposium Workshops amp PhD Forum (IPDPSW rsquo13) pp 510ndash519 IEEE Cambridge Mass USA May 2013

[37] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[38] N Krasnogor and D A Pelta ldquoMeasuring the similarity ofprotein structures by means of the universal similarity metricrdquoBioinformatics vol 20 no 7 pp 1015ndash1021 2004

[39] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[40] N Melot K Avdic C Kessler and J Keller ldquoInvestigationof main memory bandwidth on Intel Single-Chip Cloudcomputerrdquo in Proceedings of the 3rd Many-Core ApplicationsResearch Community Symposium (MARC rsquo11) pp 107ndash110Ettlingen Germany July 2011

[41] S Saini H Jin R Hood D Barker P Mehrotra and RBiswas ldquoThe impact of hyper-threading on processor resourceutilization in production applicationsrdquo in Proceedings of the18th International Conference on High Performance Computing(HiPC rsquo11) pp 1ndash10 Bangalore India December 2011

[42] H Gonzalez-Velez and M Leyton ldquoA survey of algorithmicskeleton frameworks high-level structured parallel program-ming enablersrdquo Software Practice and Experience vol 40 no12 pp 1135ndash1160 2010

[43] D K G Campbell ldquoTowards the classification of algorithmicskeletonsrdquo Tech Rep YCS 276 1996

[44] A Silberschatz P B Galvin and G Gagne Operating SystemConcepts John Wiley amp Sons Wiley Publishing 8th edition2008

[45] D M Tax and R P W Duin ldquoFeature scaling in support vectordata descriptionsrdquo in Learning from Imbalanced Datasets NJapkowicz Ed pp 25ndash30 AAAI Press Menlo Park Calif USA2000

[46] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

BioMed Research International 13

[47] R Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing ViennaAustria 2015 httpwwwr-projectorg

[48] M Veeramalai D R Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 article 138 2010

[49] G Maccari G L B Spampinato and V Tozzini ldquoSecStAnTsecondary structure analysis tool for data selection statisticsand models buildingrdquo Bioinformatics vol 30 no 5 pp 668ndash674 2014

[50] S Mertens ldquoThe easiest hard problem number partitioningrdquoin Computational Complexity and Statistical Physics A PercusG Istrate and C Moore Eds pp 125ndash139 Oxford UniversityPress New York NY USA 2006

[51] OpenMP API for parallel programming version 40 httpopenmporgwp

[52] J Handl J D Knowles and D B Kell ldquoComputational clustervalidation in post-genomic data analysisrdquo Bioinformatics vol21 no 15 pp 3201ndash3212 2005

[53] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[54] M P Forum ldquoMPI amessage-passing interface standardrdquo TechRep University of Tennessee Knoxville Tenn USA 1994

[55] Q Zou X-B Li W-R Jiang Z-Y Lin G-L Li and K ChenldquoSurvey of MapReduce frame operation in bioinformaticsrdquoBriefings in Bioinformatics 2013

[56] N Malod-Dognin and N Przulj ldquoGR-Align fast and flexiblealignment of protein 3D structures using graphlet degreesimilarityrdquo Bioinformatics vol 30 no 9 pp 1259ndash1265 2014

[57] W Xie and N V Sahinidis ldquoA reduction-based exact algorithmfor the contact map overlap problemrdquo Journal of ComputationalBiology vol 14 no 5 pp 637ndash654 2007

[58] L P Chew and K Kedem ldquoFinding the consensus shape for aprotein family (extended abstract)rdquo in Proceedings of the 18thACM Annual Symposium on Computational Geometry (SCGrsquo02) pp 64ndash73 ACM Barcelona Spain June 2002

[59] D Fischer A Elofsson D Rice and D Eisenberg ldquoAssessingthe performance of fold recognition methods by means ofa comprehensive benchmarkrdquo in Proceedings of the PacificSymposium on Biocomputing pp 300ndash318 January 1996

[60] B Rost and C Sander ldquoPrediction of protein secondary struc-ture at better than 70 accuracyrdquo Journal of Molecular Biologyvol 232 no 2 pp 584ndash599 1993

[61] A Caprara R D Carr S Istrail G Lancia and B Walenzldquo1001 Optimal PDB structure alignments integer programmingmethods for finding the maximum contact map overlaprdquoJournal of Computational Biology vol 11 no 1 pp 27ndash52 2004

[62] R Andonov N Yanev and N Malod-Dognin ldquoAn efficientlagrangian relaxation for the contact map overlap problemrdquo inAlgorithms in Bioinformatics 8th InternationalWorkshopWABI2008 Karlsruhe Germany September 15ndash19 2008 ProceedingsK A Crandall and J Lagergren Eds vol 5251 of Lecture Notesin Computer Science pp 162ndash173 Springer Berlin Germany2008

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 11: Research Article Efficient Multicriteria Protein Structure ...mark MPBT data. Legacy soware by default does not mark memory as MPBT and runs without modi cations. If the bit is set,

BioMed Research International 11

the librarywould involve separating the algorithmic skeletonscode from the communication subsystem so that it can beretargeted to other processors This would essentially requiresand-boxing the RCCE calls behindwell-structuredAPIs thatcan allow a different communication library to be plugged in

Further we would like to investigate the performancecharacteristics of all-to-all MCPSC on multicore and many-core processors with very large datasets Experimenting withsuch large datasets like the one used in [56] with more than6000 domains is likely to provide higher granularity andassist in more accurate assessment of the impact of differentload balancing schemes in terms of speedup delivered Addi-tionally cluster analysis on the results of such a large datasetwill provide a better measure of the biological relevance ofclusters generated by MCPSC A dataset of that size wouldgenerate more than 32 million pairwise comparison tasksper PSC method and keeping the number of methods to3 (as is the case in this work) would result in nearly 10million tasks Due to these large numbers processing sucha dataset specifically on a many-core processor would befeasible only with a larger NoC that is one which supportsmore memory and uses cores of newer architectures thanthe SCC Showing that such a large computation can becompleted in reasonable time would however further justifythe drive towards leveraging parallel processing capabilitiesof multicore and many-core processors

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Authorsrsquo Contribution

Anuj Sharma contributed to the design of the study devel-oped the software framework and performed the experi-ments Elias S Manolakos conceived the study and con-tributed to the design and evaluation All authors contributedto the discussion of results andwrote and approved the paper

Acknowledgments

Mr Anuj Sharma and Dr Elias S Manolakos would like toacknowledge the support of this research by the EuropeanUnion (European Social FundESF) andGreek national fundsthrough the Operational Program ldquoEducation and LifelongLearningrdquo of the National Strategic Reference Framework(NSRF) Research Funding Program ldquoHeracleitus IIrdquo Theauthors would also like to acknowledge the MicroLab ECENational Technical University of Athens Greece for allowinguse of their SCC infrastructure In addition we thank MrDimitrios Rodopoulos and Professor Dimitrios Soudris ofNTUA for providing comments and assistance

References

[1] R A Laskowski J D Watson and J M Thornton ldquoProFunca server for predicting protein function from 3D structurerdquoNucleic Acids Research vol 33 no 2 pp 89ndash93 2005

[2] B Pang N Zhao M Becchi D Korkin and C-R ShyuldquoAccelerating large-scale protein structure alignments withgraphics processing unitsrdquo BMC Research Notes vol 5 no 1article 116 2012

[3] E Bielska X Lucas A Czerwiniec J M Kasprzak K HKaminska and J M Bujnicki ldquoVirtual screening strategies indrug design methods and applicationsrdquo Journal of Biotechnol-ogy Computational Biology and Bionanotechnology vol 92 no3 pp 249ndash264 2011

[4] H Kubinyi ldquoStructure-based design of enzyme inhibitors andreceptor ligandsrdquo Current Opinion in Drug Discovery andDevelopment vol 1 no 1 pp 4ndash15 1998

[5] G Lancia and S Istrail ldquoProtein structure comparison algo-rithms and applicationsrdquo in Mathematical Methods for ProteinStructure Analysis and Design vol 2666 of Lecture Notes inComputer Science pp 1ndash33 Springer Berlin Germany 2003

[6] I Eidhammer I Jonassen andW R Taylor ldquoStructure compar-ison and structure patternsrdquo Journal of Computational Biologyvol 7 no 5 pp 685ndash716 2000

[7] Y B Ruiz-Blanco W Paz J Green and Y Marrero-PonceldquoProtDCal a program to compute general-purpose-numericaldescriptors for sequences and 3D-structures of proteinsrdquo BMCBioinformatics vol 16 no 1 article 162 2015

[8] D Barthel J D Hirst J Błazewicz E K Burke and NKrasnogor ldquoProCKSI a decision support system for protein(structure) comparison knowledge similarity and informa-tionrdquo BMC Bioinformatics vol 8 article 416 2007

[9] M Arriagada and A Poleksic ldquoOn the difference in qualitybetween current heuristic and optimal solutions to the proteinstructure alignment problemrdquo BioMed Research Internationalvol 2013 Article ID 459248 8 pages 2013

[10] D E Robillard P T Mpangase S Hazelhurst and F DehneldquoSpeeDB fast structural protein searchesrdquo Bioinformatics vol31 no 18 pp 3027ndash3034 2015

[11] M Veeramalai D Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 no 1 article 138 2010

[12] S B Pandit and J Skolnick ldquoFr-TM-align a new proteinstructural alignment method based on fragment alignmentsand the TM-scorerdquo BMC Bioinformatics vol 9 no 1 article 5312008

[13] L Mavridis and D Ritchie ldquo3D-blast 3D protein structurealignment comparison and classification using spherical polarFourier correlationsrdquo in Proceedings of the Pacific Symposium onBiocomputing pp 281ndash292 World Scientific PublishingKamuela Hawaii USA January 2010 httphalinriafrinria-00434263en

[14] A A Shah G Folino and N Krasnogor ldquoToward high-throughput multicriteria protein-structure comparison andanalysisrdquo IEEE Transactions on Nanobioscience vol 9 no 2 pp144ndash155 2010

[15] L I Kuncheva and C J Whitaker ldquoMeasures of diversity inclassifier ensembles and their relationship with the ensembleaccuracyrdquo Machine Learning vol 51 no 2 pp 181ndash207 2003

[16] P Sollich and A Krogh ldquoLearning with ensembles howoverfitting can be usefulrdquo in Proceedings of the Advances inNeural Information Processing Systems (NIPS rsquo96) vol 8 pp190ndash196 1996

12 BioMed Research International

[17] G Brown J Wyatt R Harris and X Yao ldquoDiversity creationmethods a survey and categorisationrdquo Information Fusion vol6 no 1 pp 5ndash20 2005

[18] P Chi Efficient protein tertiary structure retrievals and clas-sifications using content based comparison algorithms [PhDdissertation] University of Missouri at Columbia ColumbiaMo USA 2007

[19] A Poleksic ldquoAlgorithms for optimal protein structure align-mentrdquo Bioinformatics vol 25 no 21 pp 2751ndash2756 2009

[20] A A Shah Studies on distributed approaches for large scalemulti-criteria protein structure comparison and analysis [PhDthesis] University of Nottingham Nottingham UK 2011httpeprintsnottinghamacuk11735

[21] M Pharr and R Fernando GPU Gems 2 Programming Tech-niques for High-Performance Graphics and General-PurposeComputation Pearson Eduction Addison-Wesley Professional2005

[22] M Azimi N Cherukuri D Jayashima et al ldquoIntegrationchallenges and tradeoffs for tera-scale architecturesrdquo Intel Tech-nology Journal vol 11 pp 173ndash184 2007

[23] S Sarkar T Majumder A Kalyanaraman and P P PandeldquoHardware accelerators for biocomputing a surveyrdquo in Pro-ceedings of the IEEE International Symposium on Circuits andSystems (ISCAS rsquo10) pp 3789ndash3792 IEEE Paris France June2010

[24] E Kouskoumvekakis D Soudris and E S Manolakos ldquoMany-core CPUs can deliver scalable performance to stochasticsimulations of large-scale biochemical reaction networksrdquo inProceedings of the International Conference onHigh PerformanceComputing amp Simulation (HPCS rsquo15) pp 517ndash524 IEEE Ams-terdam The Netherlands July 2015

[25] A A Shah D Barthel and N Krasnogor ldquoGrid and distributedpublic computing schemes for structural proteomics a shortoverviewrdquo in Frontiers of High Performance Computing andNetworking ISPA 2007 Workshops vol 4743 pp 424ndash434Springer Berlin Germany 2007

[26] E Totoni B Behzad S Ghike and J Torrellas ldquoComparing thepower and performance of Intelrsquos SCC to state-of-the-art CPUsand GPUsrdquo in Proceedings of the IEEE International Symposiumon Performance Analysis of Systems amp Software (ISPASS rsquo12) pp78ndash87 IEEE Computer Society New Brunswick NJ USA April2012

[27] T Bjerregaard and S Mahadevan ldquoA survey of research andpractices of network-on-chiprdquoACMComputing Surveys vol 38no 1 article 1 Article ID 1132953 2006

[28] B Marker E Chan J Poulson et al ldquoProgramming many-corearchitecturesmdasha case study dense matrix computations on theIntel single-chip cloud computer processorrdquo Concurrency andComputation Practice and Experience 2011

[29] G Blake R G Dreslinski and TMudge ldquoA survey of multicoreprocessorsrdquo IEEE Signal Processing Magazine vol 26 no 6 pp26ndash37 2009

[30] P Guerrier and A Greiner ldquoA generic architecture for on-chip packet-switched interconnectionsrdquo in Proceedings of theConference on Design Automation and Test in Europe (DATErsquo00) pp 250ndash256 ACM Paris France 2000

[31] D Atienza F Angiolini S Murali A Pullini L Benini and GDe Micheli ldquoNetwork-on-chip design and synthesis outlookrdquoIntegration vol 41 no 2 pp 1ndash35 2008

[32] R P Mohanty A K Turuk and B Sahoo ldquoPerformanceevaluation of multi-core processors with varied interconnectnetworksrdquo in Proceedings of the 2nd International Conference onAdvanced Computing Networking and Security (ADCONS rsquo13)pp 7ndash11 IEEE Computer Society Mangalore India December2013

[33] S Isaza Multicore architectures for bioinformatics applications[PhD thesis] University of Lugano Lugano Switzerland 2011

[34] S B Needleman and C D Wunsch ldquoA general method appli-cable to the search for similarities in the amino acid sequenceof two proteinsrdquo Journal of Molecular Biology vol 48 no 3 pp443ndash453 1970

[35] S Sarkar G R Kulkarni P P Pande and A Kalyanara-man ldquoNetwork-on-chip hardware accelerators for biologicalsequence alignmentrdquo IEEE Transactions on Computers vol 59no 1 pp 29ndash41 2010

[36] A Sharma A Papanikolaou and E SManolakos ldquoAcceleratingall-to-all protein structures comparison with TMalign usinga NoC many-cores processor architecturerdquo in Proceedings ofthe 27th IEEE International Parallel and Distributed ProcessingSymposium Workshops amp PhD Forum (IPDPSW rsquo13) pp 510ndash519 IEEE Cambridge Mass USA May 2013

[37] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[38] N Krasnogor and D A Pelta ldquoMeasuring the similarity ofprotein structures by means of the universal similarity metricrdquoBioinformatics vol 20 no 7 pp 1015ndash1021 2004

[39] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[40] N Melot K Avdic C Kessler and J Keller ldquoInvestigationof main memory bandwidth on Intel Single-Chip Cloudcomputerrdquo in Proceedings of the 3rd Many-Core ApplicationsResearch Community Symposium (MARC rsquo11) pp 107ndash110Ettlingen Germany July 2011

[41] S Saini H Jin R Hood D Barker P Mehrotra and RBiswas ldquoThe impact of hyper-threading on processor resourceutilization in production applicationsrdquo in Proceedings of the18th International Conference on High Performance Computing(HiPC rsquo11) pp 1ndash10 Bangalore India December 2011

[42] H Gonzalez-Velez and M Leyton ldquoA survey of algorithmicskeleton frameworks high-level structured parallel program-ming enablersrdquo Software Practice and Experience vol 40 no12 pp 1135ndash1160 2010

[43] D K G Campbell ldquoTowards the classification of algorithmicskeletonsrdquo Tech Rep YCS 276 1996

[44] A Silberschatz P B Galvin and G Gagne Operating SystemConcepts John Wiley amp Sons Wiley Publishing 8th edition2008

[45] D M Tax and R P W Duin ldquoFeature scaling in support vectordata descriptionsrdquo in Learning from Imbalanced Datasets NJapkowicz Ed pp 25ndash30 AAAI Press Menlo Park Calif USA2000

[46] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

BioMed Research International 13

[47] R Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing ViennaAustria 2015 httpwwwr-projectorg

[48] M Veeramalai D R Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 article 138 2010

[49] G Maccari G L B Spampinato and V Tozzini ldquoSecStAnTsecondary structure analysis tool for data selection statisticsand models buildingrdquo Bioinformatics vol 30 no 5 pp 668ndash674 2014

[50] S Mertens ldquoThe easiest hard problem number partitioningrdquoin Computational Complexity and Statistical Physics A PercusG Istrate and C Moore Eds pp 125ndash139 Oxford UniversityPress New York NY USA 2006

[51] OpenMP API for parallel programming version 40 httpopenmporgwp

[52] J Handl J D Knowles and D B Kell ldquoComputational clustervalidation in post-genomic data analysisrdquo Bioinformatics vol21 no 15 pp 3201ndash3212 2005

[53] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[54] M P Forum ldquoMPI amessage-passing interface standardrdquo TechRep University of Tennessee Knoxville Tenn USA 1994

[55] Q Zou X-B Li W-R Jiang Z-Y Lin G-L Li and K ChenldquoSurvey of MapReduce frame operation in bioinformaticsrdquoBriefings in Bioinformatics 2013

[56] N Malod-Dognin and N Przulj ldquoGR-Align fast and flexiblealignment of protein 3D structures using graphlet degreesimilarityrdquo Bioinformatics vol 30 no 9 pp 1259ndash1265 2014

[57] W Xie and N V Sahinidis ldquoA reduction-based exact algorithmfor the contact map overlap problemrdquo Journal of ComputationalBiology vol 14 no 5 pp 637ndash654 2007

[58] L P Chew and K Kedem ldquoFinding the consensus shape for aprotein family (extended abstract)rdquo in Proceedings of the 18thACM Annual Symposium on Computational Geometry (SCGrsquo02) pp 64ndash73 ACM Barcelona Spain June 2002

[59] D Fischer A Elofsson D Rice and D Eisenberg ldquoAssessingthe performance of fold recognition methods by means ofa comprehensive benchmarkrdquo in Proceedings of the PacificSymposium on Biocomputing pp 300ndash318 January 1996

[60] B Rost and C Sander ldquoPrediction of protein secondary struc-ture at better than 70 accuracyrdquo Journal of Molecular Biologyvol 232 no 2 pp 584ndash599 1993

[61] A Caprara R D Carr S Istrail G Lancia and B Walenzldquo1001 Optimal PDB structure alignments integer programmingmethods for finding the maximum contact map overlaprdquoJournal of Computational Biology vol 11 no 1 pp 27ndash52 2004

[62] R Andonov N Yanev and N Malod-Dognin ldquoAn efficientlagrangian relaxation for the contact map overlap problemrdquo inAlgorithms in Bioinformatics 8th InternationalWorkshopWABI2008 Karlsruhe Germany September 15ndash19 2008 ProceedingsK A Crandall and J Lagergren Eds vol 5251 of Lecture Notesin Computer Science pp 162ndash173 Springer Berlin Germany2008

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 12: Research Article Efficient Multicriteria Protein Structure ...mark MPBT data. Legacy soware by default does not mark memory as MPBT and runs without modi cations. If the bit is set,

12 BioMed Research International

[17] G Brown J Wyatt R Harris and X Yao ldquoDiversity creationmethods a survey and categorisationrdquo Information Fusion vol6 no 1 pp 5ndash20 2005

[18] P Chi Efficient protein tertiary structure retrievals and clas-sifications using content based comparison algorithms [PhDdissertation] University of Missouri at Columbia ColumbiaMo USA 2007

[19] A Poleksic ldquoAlgorithms for optimal protein structure align-mentrdquo Bioinformatics vol 25 no 21 pp 2751ndash2756 2009

[20] A A Shah Studies on distributed approaches for large scalemulti-criteria protein structure comparison and analysis [PhDthesis] University of Nottingham Nottingham UK 2011httpeprintsnottinghamacuk11735

[21] M Pharr and R Fernando GPU Gems 2 Programming Tech-niques for High-Performance Graphics and General-PurposeComputation Pearson Eduction Addison-Wesley Professional2005

[22] M Azimi N Cherukuri D Jayashima et al ldquoIntegrationchallenges and tradeoffs for tera-scale architecturesrdquo Intel Tech-nology Journal vol 11 pp 173ndash184 2007

[23] S Sarkar T Majumder A Kalyanaraman and P P PandeldquoHardware accelerators for biocomputing a surveyrdquo in Pro-ceedings of the IEEE International Symposium on Circuits andSystems (ISCAS rsquo10) pp 3789ndash3792 IEEE Paris France June2010

[24] E Kouskoumvekakis D Soudris and E S Manolakos ldquoMany-core CPUs can deliver scalable performance to stochasticsimulations of large-scale biochemical reaction networksrdquo inProceedings of the International Conference onHigh PerformanceComputing amp Simulation (HPCS rsquo15) pp 517ndash524 IEEE Ams-terdam The Netherlands July 2015

[25] A A Shah D Barthel and N Krasnogor ldquoGrid and distributedpublic computing schemes for structural proteomics a shortoverviewrdquo in Frontiers of High Performance Computing andNetworking ISPA 2007 Workshops vol 4743 pp 424ndash434Springer Berlin Germany 2007

[26] E Totoni B Behzad S Ghike and J Torrellas ldquoComparing thepower and performance of Intelrsquos SCC to state-of-the-art CPUsand GPUsrdquo in Proceedings of the IEEE International Symposiumon Performance Analysis of Systems amp Software (ISPASS rsquo12) pp78ndash87 IEEE Computer Society New Brunswick NJ USA April2012

[27] T Bjerregaard and S Mahadevan ldquoA survey of research andpractices of network-on-chiprdquoACMComputing Surveys vol 38no 1 article 1 Article ID 1132953 2006

[28] B Marker E Chan J Poulson et al ldquoProgramming many-corearchitecturesmdasha case study dense matrix computations on theIntel single-chip cloud computer processorrdquo Concurrency andComputation Practice and Experience 2011

[29] G Blake R G Dreslinski and TMudge ldquoA survey of multicoreprocessorsrdquo IEEE Signal Processing Magazine vol 26 no 6 pp26ndash37 2009

[30] P Guerrier and A Greiner ldquoA generic architecture for on-chip packet-switched interconnectionsrdquo in Proceedings of theConference on Design Automation and Test in Europe (DATErsquo00) pp 250ndash256 ACM Paris France 2000

[31] D Atienza F Angiolini S Murali A Pullini L Benini and GDe Micheli ldquoNetwork-on-chip design and synthesis outlookrdquoIntegration vol 41 no 2 pp 1ndash35 2008

[32] R P Mohanty A K Turuk and B Sahoo ldquoPerformanceevaluation of multi-core processors with varied interconnectnetworksrdquo in Proceedings of the 2nd International Conference onAdvanced Computing Networking and Security (ADCONS rsquo13)pp 7ndash11 IEEE Computer Society Mangalore India December2013

[33] S Isaza Multicore architectures for bioinformatics applications[PhD thesis] University of Lugano Lugano Switzerland 2011

[34] S B Needleman and C D Wunsch ldquoA general method appli-cable to the search for similarities in the amino acid sequenceof two proteinsrdquo Journal of Molecular Biology vol 48 no 3 pp443ndash453 1970

[35] S Sarkar G R Kulkarni P P Pande and A Kalyanara-man ldquoNetwork-on-chip hardware accelerators for biologicalsequence alignmentrdquo IEEE Transactions on Computers vol 59no 1 pp 29ndash41 2010

[36] A Sharma A Papanikolaou and E SManolakos ldquoAcceleratingall-to-all protein structures comparison with TMalign usinga NoC many-cores processor architecturerdquo in Proceedings ofthe 27th IEEE International Parallel and Distributed ProcessingSymposium Workshops amp PhD Forum (IPDPSW rsquo13) pp 510ndash519 IEEE Cambridge Mass USA May 2013

[37] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[38] N Krasnogor and D A Pelta ldquoMeasuring the similarity ofprotein structures by means of the universal similarity metricrdquoBioinformatics vol 20 no 7 pp 1015ndash1021 2004

[39] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[40] N Melot K Avdic C Kessler and J Keller ldquoInvestigationof main memory bandwidth on Intel Single-Chip Cloudcomputerrdquo in Proceedings of the 3rd Many-Core ApplicationsResearch Community Symposium (MARC rsquo11) pp 107ndash110Ettlingen Germany July 2011

[41] S Saini H Jin R Hood D Barker P Mehrotra and RBiswas ldquoThe impact of hyper-threading on processor resourceutilization in production applicationsrdquo in Proceedings of the18th International Conference on High Performance Computing(HiPC rsquo11) pp 1ndash10 Bangalore India December 2011

[42] H Gonzalez-Velez and M Leyton ldquoA survey of algorithmicskeleton frameworks high-level structured parallel program-ming enablersrdquo Software Practice and Experience vol 40 no12 pp 1135ndash1160 2010

[43] D K G Campbell ldquoTowards the classification of algorithmicskeletonsrdquo Tech Rep YCS 276 1996

[44] A Silberschatz P B Galvin and G Gagne Operating SystemConcepts John Wiley amp Sons Wiley Publishing 8th edition2008

[45] D M Tax and R P W Duin ldquoFeature scaling in support vectordata descriptionsrdquo in Learning from Imbalanced Datasets NJapkowicz Ed pp 25ndash30 AAAI Press Menlo Park Calif USA2000

[46] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

BioMed Research International 13

[47] R Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing ViennaAustria 2015 httpwwwr-projectorg

[48] M Veeramalai D R Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 article 138 2010

[49] G Maccari G L B Spampinato and V Tozzini ldquoSecStAnTsecondary structure analysis tool for data selection statisticsand models buildingrdquo Bioinformatics vol 30 no 5 pp 668ndash674 2014

[50] S Mertens ldquoThe easiest hard problem number partitioningrdquoin Computational Complexity and Statistical Physics A PercusG Istrate and C Moore Eds pp 125ndash139 Oxford UniversityPress New York NY USA 2006

[51] OpenMP API for parallel programming version 40 httpopenmporgwp

[52] J Handl J D Knowles and D B Kell ldquoComputational clustervalidation in post-genomic data analysisrdquo Bioinformatics vol21 no 15 pp 3201ndash3212 2005

[53] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[54] M P Forum ldquoMPI amessage-passing interface standardrdquo TechRep University of Tennessee Knoxville Tenn USA 1994

[55] Q Zou X-B Li W-R Jiang Z-Y Lin G-L Li and K ChenldquoSurvey of MapReduce frame operation in bioinformaticsrdquoBriefings in Bioinformatics 2013

[56] N Malod-Dognin and N Przulj ldquoGR-Align fast and flexiblealignment of protein 3D structures using graphlet degreesimilarityrdquo Bioinformatics vol 30 no 9 pp 1259ndash1265 2014

[57] W Xie and N V Sahinidis ldquoA reduction-based exact algorithmfor the contact map overlap problemrdquo Journal of ComputationalBiology vol 14 no 5 pp 637ndash654 2007

[58] L P Chew and K Kedem ldquoFinding the consensus shape for aprotein family (extended abstract)rdquo in Proceedings of the 18thACM Annual Symposium on Computational Geometry (SCGrsquo02) pp 64ndash73 ACM Barcelona Spain June 2002

[59] D Fischer A Elofsson D Rice and D Eisenberg ldquoAssessingthe performance of fold recognition methods by means ofa comprehensive benchmarkrdquo in Proceedings of the PacificSymposium on Biocomputing pp 300ndash318 January 1996

[60] B Rost and C Sander ldquoPrediction of protein secondary struc-ture at better than 70 accuracyrdquo Journal of Molecular Biologyvol 232 no 2 pp 584ndash599 1993

[61] A Caprara R D Carr S Istrail G Lancia and B Walenzldquo1001 Optimal PDB structure alignments integer programmingmethods for finding the maximum contact map overlaprdquoJournal of Computational Biology vol 11 no 1 pp 27ndash52 2004

[62] R Andonov N Yanev and N Malod-Dognin ldquoAn efficientlagrangian relaxation for the contact map overlap problemrdquo inAlgorithms in Bioinformatics 8th InternationalWorkshopWABI2008 Karlsruhe Germany September 15ndash19 2008 ProceedingsK A Crandall and J Lagergren Eds vol 5251 of Lecture Notesin Computer Science pp 162ndash173 Springer Berlin Germany2008

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 13: Research Article Efficient Multicriteria Protein Structure ...mark MPBT data. Legacy soware by default does not mark memory as MPBT and runs without modi cations. If the bit is set,

BioMed Research International 13

[47] R Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing ViennaAustria 2015 httpwwwr-projectorg

[48] M Veeramalai D R Gilbert and G Valiente ldquoAn optimizedTOPS+ comparisonmethod for enhanced TOPSmodelsrdquo BMCBioinformatics vol 11 article 138 2010

[49] G Maccari G L B Spampinato and V Tozzini ldquoSecStAnTsecondary structure analysis tool for data selection statisticsand models buildingrdquo Bioinformatics vol 30 no 5 pp 668ndash674 2014

[50] S Mertens ldquoThe easiest hard problem number partitioningrdquoin Computational Complexity and Statistical Physics A PercusG Istrate and C Moore Eds pp 125ndash139 Oxford UniversityPress New York NY USA 2006

[51] OpenMP API for parallel programming version 40 httpopenmporgwp

[52] J Handl J D Knowles and D B Kell ldquoComputational clustervalidation in post-genomic data analysisrdquo Bioinformatics vol21 no 15 pp 3201ndash3212 2005

[53] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[54] M P Forum ldquoMPI amessage-passing interface standardrdquo TechRep University of Tennessee Knoxville Tenn USA 1994

[55] Q Zou X-B Li W-R Jiang Z-Y Lin G-L Li and K ChenldquoSurvey of MapReduce frame operation in bioinformaticsrdquoBriefings in Bioinformatics 2013

[56] N Malod-Dognin and N Przulj ldquoGR-Align fast and flexiblealignment of protein 3D structures using graphlet degreesimilarityrdquo Bioinformatics vol 30 no 9 pp 1259ndash1265 2014

[57] W Xie and N V Sahinidis ldquoA reduction-based exact algorithmfor the contact map overlap problemrdquo Journal of ComputationalBiology vol 14 no 5 pp 637ndash654 2007

[58] L P Chew and K Kedem ldquoFinding the consensus shape for aprotein family (extended abstract)rdquo in Proceedings of the 18thACM Annual Symposium on Computational Geometry (SCGrsquo02) pp 64ndash73 ACM Barcelona Spain June 2002

[59] D Fischer A Elofsson D Rice and D Eisenberg ldquoAssessingthe performance of fold recognition methods by means ofa comprehensive benchmarkrdquo in Proceedings of the PacificSymposium on Biocomputing pp 300ndash318 January 1996

[60] B Rost and C Sander ldquoPrediction of protein secondary struc-ture at better than 70 accuracyrdquo Journal of Molecular Biologyvol 232 no 2 pp 584ndash599 1993

[61] A Caprara R D Carr S Istrail G Lancia and B Walenzldquo1001 Optimal PDB structure alignments integer programmingmethods for finding the maximum contact map overlaprdquoJournal of Computational Biology vol 11 no 1 pp 27ndash52 2004

[62] R Andonov N Yanev and N Malod-Dognin ldquoAn efficientlagrangian relaxation for the contact map overlap problemrdquo inAlgorithms in Bioinformatics 8th InternationalWorkshopWABI2008 Karlsruhe Germany September 15ndash19 2008 ProceedingsK A Crandall and J Lagergren Eds vol 5251 of Lecture Notesin Computer Science pp 162ndash173 Springer Berlin Germany2008

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 14: Research Article Efficient Multicriteria Protein Structure ...mark MPBT data. Legacy soware by default does not mark memory as MPBT and runs without modi cations. If the bit is set,

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology