101
About this Journal Advances in Bioinformatics is a peer-reviewed, open access journal that publishes original research articles as well as review articles in all areas of bioinformatics. Aims and Scope Advances in Bioinformatics is a peer-reviewed, open access journal that publishes original research articles as well as review articles in all areas of bioinformatics. Advances in Bioinformatics ISSN: 1687-8027 Volume 2014 No. 1, June 2014

Advances in Bioinformatics-1

Embed Size (px)

DESCRIPTION

Journal of Advances in Bioinformatics.

Citation preview

  • About this Journal

    Advances in Bioinformatics is a peer-reviewed, open access journal that publishes original research articles as well as review articles in all areas of bioinformatics.

    Aims and Scope

    Advances in Bioinformatics is a peer-reviewed, open access journal that publishes original research articles as well as review articles in all areas of bioinformatics.

    Advances in

    Bioinformatics ISSN: 1687-8027 Volume 2014 No. 1, June 2014

  • Abstracting and IndexingThe articles of Advances in Bioinformatics are included in the following databases/resources:

    Academic OneFile

    Academic Search Complete

    Access to Global Online Research in Agriculture (AGORA)

    Airiti Library

    Applied Science and Technology Source

    Biological Sciences

    BioMedSearch

    Biotechnology and BioEngineering Abstracts

    Biotechnology Research Abstracts

    CAB Abstracts

    Chemical Abstracts Service (CAS)

    CNKI Scholar

    Computers and Applied Sciences Complete

    CSA Illustrata - Natural Sciences

    CSA Illustrata - Technology

    CSA Technology Research Database

    Current Abstracts

    Directory of Open Access Journals (DOAJ)

    EBSCO Discovery Service

    EBSCOhost Connection

    Expanded Academic Index

    Google Scholar

    HINARI Access to Research in Health Programme

    InfoTrac Custom journals

    INSPEC

    J-Gate Portal

    Odysci Academic Search

    ProQuest Advanced Technologies and Aerospace Collection

    ProQuest Biological Science Collection

    ProQuest Computer Science Journals

    ProQuest Natural Science Collection

    ProQuest SciTech Collection

    PubMed

    PubMed Central

    Scopus

    The DBLP Computer Science Bibliography

    The Index of Information Systems Journals

    The Informatics Portal io-port.net

    TOC Premier

    Advances in

    Bioinformatics ISSN: 1687-8027 Volume 2014 No. 1, June 2014

  • Editorial Board

    Shandar Ahmad, National Institute of Biomedical Innovation, Japan Tatsuya Akutsu, Kyoto University, Japan Rolf Backofen, University of Freiburg, Germany Craig Benham, University of California, Davis, USA Mark Borodovsky, Georgia Institute of Technology, USA Rita Casadio, Universit di Bologna, Italy Ming Chen, Zhejiang University, China David Corne, Heriot Watt University, United Kingdom Bhaskar Dasgupta, University of Illinois at Chicago, USA Ramana Davuluri, The Wistar Institute, USA J. Dopazo, Felipe Research Centre, Spain Anton Enright, European Bioinformatics Institute, United Kingdom Stavros J. Hamodrakas, National and Capodistrian University of Athens, Greece Paul Harrison, McGill University, USA Huixiao Hong, U.S. Food and Drug Administration, USA David Jones, University College London, United Kingdom George Karypis, University of Minnesota, USA Jian-Liang Li, Sanford-Burnham Medical Research Institute, USA Jie Liang, University of Illinois at Chicago, USA Guohui Lin, University of Alberta, Canada Pietro Li, University of Cambridge, United Kingdom Dennis Livesay, University of North Carolina at Charlotte, USA Satoru Miyano, The University of Tokyo, Japan Burkhard Morgenstern, University of Goettingen, Germany Masha Niv, Hebrew University of Jerusalem, Israel Florencio Pazos, Consejo Superior de Investigaciones Cientficas, Spain David Posada, Universidad de Vigo, Spain Jagath Rajapakse, Nanyang Technological University, Singapore Marcel J. T. Reinders, Delft University of Technology, The Netherlands P. Rouze, Ghent University, Belgium Alejandro A. Schffer, National Institutes of Health, USA E. L. Sonnhammer, Stockholm University, Sweden Sandor Vajda, Boston University, USA Yves Van de Peer, U Gent, Belgium Antoine van Kampen, University of Amsterdam, The Netherlands Alexander Zelikovsky, Georgia State University, USA Zhongming Zhao, Vanderbilt University, USA Yi Ming Zou, University of Wisconsin-Milwaukee, USA

  • Editorial Workflow

    The following is the editorial workflow that every manuscript submitted to the journal undergoes during the course of the peer-review process.

    The entire editorial workflow is performed using the online Manuscript Tracking System. Once a manuscript is submitted it is sent to an appropriate Editor based on the subject of the manuscript and the availability of the Editors. If the Editor finds that the manuscript may not be of sufficient quality to go through the normal peer review process, or that the subject of the manuscript may not be appropriate for the journals scope, the Editor may Refuse to Consider the manuscript. In this case, the manuscript is sent to a second Editor, and if the second Editor also chooses to Refuse to Consider the manuscript, the manuscript shall be rejected with no further processing.

    If the Editor finds that the submitted manuscript is of sufficient quality and falls within the scope of the journal, they would assign the manuscript to a minimum of 2 and a maximum of 5 external reviewers for peer-review. The reviewers submit their reports on the manuscripts along with their recommendation of one of the following actions to the Editor:

    Publish Unaltered Consider after Minor Changes Consider after Major Changes Reject: Manuscript is flawed or not sufficiently novel

    When all reviewers have submitted their reports, the Editor can make one of the following editorial recommendations:

    Publish Unaltered Consider after Minor Changes Consider after Major Changes Reject

    If the Editor recommends Publish Unaltered, the manuscript is accepted for publication.

    If the Editor recommends Consider after Minor Changes, the authors are notified to prepare and submit a final copy of their manuscript with the required minor changes suggested by the reviewers. The Editor reviews the revised manuscript after the minor changes have been made by the authors. Once the Editor is satisfied with the final manuscript, the manuscript can be accepted.

    If the Editor recommends Consider after Major Changes, the recommendation is communicated to the authors. The authors are expected to revise their manuscripts in accordance with the changes recommended by the reviewers and to submit their revised manuscript in a timely manner. Once the revised manuscript is submitted, the Editor can then make an editorial recommendation which can be Publish Unaltered, Consider after Minor Changes, or Reject.

    If the Editor recommends rejecting the manuscript, the rejection is immediate. Also, if two of the reviewers recommend rejecting the manuscript, the rejection is immediate.

    The editorial workflow gives the Editors the authority to reject any manuscript because of inappropriateness of its subject, lack of quality, or incorrectness of its results. The Editor cannot assign himself/herself as an external reviewer of the manuscript. This is to ensure a high-quality, fair, and unbiased peer-review process of every manuscript submitted to the journal, since any manuscript must be recommended by one or more (usually two or more) external reviewers along with the Editor in charge of the manuscript in order for it to be accepted for publication in the journal.

  • The name of the Editor recommending the manuscript for publication is published with the manuscript to indicate and acknowledge their invaluable contribution to the peer-review process and the indispensability of their contributions to the running of the journals.

    The peer-review process is single blinded; that is, the reviewers know who the authors of the manuscript are, but the authors do not have access to the information of who the peer reviewers are. Every journal published by Hindawi has an acknowledgment page for the researchers who have performed the peer-review process for one or more of the journal manuscripts in the past year. Without the significant contributions made by these researchers, the publication of the journal would not be possible.

  • Advances in

    Bioinformatics ISSN: 1687-8027 Volume 2014 No. 1, June 2014

    Table of Contents

    Comparing Imputation Procedures for Affymetrix Gene Expression Datasets 01-10 Using MAQC DatasetsSreevidya Sadananda Sadasiva Rao, Lori A. Shepherd, Andrew E. Bruno, Song Liu,and Jeffrey C. Miecznikowski

    A Multilevel Gamma-Clustering Layout Algorithm for Visualization of Biological 11-20 NetworksTomas Hruz, Markus Wyss, Christoph Lucas, Oliver Laule, Peter von Rohr,

    Philip Zimmermann and Stefan Bleuler

    Reverse Engineering Sparse Gene Regulatory Networks Using Cubature Kalman 21-31Filter and Compressed SensingAmina Noor, Erchin Serpedin, Mohamed Nounou, and Hazem Nounou

    Efficient Serial and Parallel Algorithms for Selection of Unique Oligos in EST 32-37DatabasesManrique Mata-Montero, Nabil Shalaby, and Bradley Sheppard

    Gene Regulation, Modulation, and Their Applications in Gene Expression 38-48 Data AnalysisMario Flores, Tzu-Hung Hsiao, Yu-Chiao Chiu, Eric Y. Chuang, Yufei Huangand Yidong Chen

    Spectral Analysis on Time-Course Expression Data: Detecting Periodic 49-58 Genes Using a Real-Valued Iterative Adaptive ApproachKwadwo S. Agyepong, Fang-Han Hsu, Edward R. Dougherty, and Erchin Serpedin

    Identification of Robust Pathway Markers for Cancer through Rank-Based 59-66Pathway Activity InferenceNavadon Khunlertgit and Byung-Jun Yoon

    An Overview of the Statistical Methods Used for Inferring Gene Regulatory 67-78 Networks and Protein-Protein Interaction NetworksAmina Noor, Erchin Serpedin, Mohamed Nounou, Hazem Nounou, Nady Mohamed

    and Lotfi Chouchane

    Using Protein Clusters from Whole Proteomes to Construct and Augment 79-86 a DendrogramYunyun Zhou, Douglas R. Call, and Shira L. Broschat

    Solving the 0/1 Knapsack Problem by a Biomolecular DNA Computer 87-92Hassan Taghipour, Mahdi Rezaei, and Heydar Ali Esmaili

  • Research Article

    Comparing Imputation Procedures for Affymetrix Gene

    Expression Datasets Using MAQC Datasets

    Sreevidya Sadananda Sadasiva Rao,1Lori A. Shepherd,

    1Andrew E. Bruno,

    2

    Song Liu,1and Jeffrey C. Miecznikowski

    1,3

    1 Department of Biostatistics, Roswell Park Cancer Institute, Bufalo, NY 14263, USA2 Center for Computational Research, University at Bufalo, NYS Center of Excellence in Bioinformatics and Life Sciences,Bufalo, NY 14203, USA

    3Department of Biostatistics, SUNY University at Bufalo, Bufalo, NY 14214, USA

    Correspondence should be addressed to Jefrey C. Miecznikowski; [email protected]

    Received 26 June 2013; Accepted 28 August 2013

    Academic Editor: Shandar Ahmad

    Copyright 2013 Sreevidya Sadananda Sadasiva Rao et al. Tis is an open access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

    Introduction. Te microarray datasets from the MicroArray Quality Control (MAQC) project have enabled the assessment of theprecision, comparability of microarrays, and other various microarray analysis methods. However, to date no studies that we areaware of have reported the performance of missing value imputation schemes on the MAQC datasets. In this study, we use theMAQC Afymetrix datasets to evaluate several imputation procedures in Afymetrix microarrays. Results. We evaluated severalcutting edge imputation procedures and compared them using diferent error measures. We randomly deleted 5% and 10% of thedata and imputed the missing values using imputation tests. We performed 1000 simulations and averaged the results. Te resultsfor both 5% and 10% deletion are similar. Among the imputation methods, we observe the local least squares method with = 4 ismost accurate under the error measures considered. Te k-nearest neighbor method with = 1 has the highest error rate amongimputation methods and error measures. Conclusions.We conclude for imputing missing values in Afymetrix microarray datasets,using the MAS 5.0 preprocessing scheme, the local least squares method with = 4 has the best overall performance and k-nearestneighbor method with = 1 has the worst overall performance. Tese results hold true for both 5% and 10% missing values.

    1. Introduction

    In microarray experiments, randomly missing values mayoccur due to scratches on the chip, spotting errors, dust, orhybridization errors. Other nonrandom missing values maybe biological in nature, for example, probes with low intensityvalues or intensity values that may exceed a readable thresh-old. Tese missing values will create incomplete gene expres-sion matrices where the rows refer to genes and the columnsrefer to samples. Tese incomplete expression matrices willmake it difcult for researchers to perform downstreamanalyses such as diferential expression inference, clusteringor dimension reductionmethods (e.g., principal componentsanalysis), or multidimensional scaling. Hence, it is critical tounderstand the nature of the missing values and to choose anaccurate method to impute the missing values.

    Tere have been several methods put forth to imputemissing data in microarray experiments. In one of the frstpapers related to microarrays, Troyanskaya et al. [1] examineseveral methods of imputing missing data and ultimatelysuggest a -nearest neighbors approach. Researchers alsoexplored applying previously developed schemes formicroar-rays such as the nonlinear iterative partial least squares(NIPALS) as discussed by Wold [2]. A Bayesian approach formissing data in gene expression microarrays is provided byOba et al. [3]. Other approaches such as that of B et al. [4]suggest using least squares methods to estimate the missingvalues in microarray data, while Kim et al. [5] suggest usinga local least squares imputation. A Gaussian mixture methodfor imputing missing data is proposed by Ouyang et al. [6].

    While many of these approaches can be generally appliedto diferent types of gene expression arrays, we will focus

    Hindawi Publishing Corporation

    Advances in Bioinformatics

    Volume 2014, No. 1, June 2014doi:10.1155/2012/705435

  • 2 Advances in Bioinformatics

    on applying these methods to Afymetrix gene expressionarrays, one of the most popular arrays in scientifc research.Naturally, when proposing a new imputation scheme forexpression arrays, it is necessary to compare the new methodagainst existing methods. Several excellent papers have com-pared missing data procedures on high throughput dataplatforms such as in two-dimensional gel electrophoresis asin Miecznikowski et al.s works [7] or gene expression arrays[810]. Before studying missing data imputation schemes inAfymetrix gene expression arrays, it is reasonable to frstremove any existing missing values. In this way, we ensurethat any subsequent missing values have known true values.A detection call algorithm is used to flter and removemissingexpression values based on absent/present calls [11]. Subse-quently, a preprocessing scheme is then employed. Tere arenumerous tasks to perform in preprocessing Afymetrixarrays, including background adjustment, normalization,and summarization. A good overview of the methods avail-able for preprocessing is provided by Gentleman et al. [12].For our analysis, the detection call employs MAS 5.0 [13] toobtain expression values; thus, we also use the MAS 5.0 suiteof functions as our preprocessing method.

    For our analysis, we focus on the microarray quality con-trol (MAQC) datasets (Accession no. GSE5350), where thedatasets have been specifcally designed to address the pointsof strength and weakness of various microarray analysismethods.TeMAQC datasets were designed by the US Foodand Drug Administration to provide quality control (QC)tools to the microarray community to avoid procedural fail-ures. Te project aimed to develop guidelines for microarraydata analysis by providing the public with large referencedatasets along with readily accessible reference ribonucleicacid (RNA) samples. Another purpose of this project was toestablish QC metrics and thresholds for objectively assessingthe performance achievable by various microarray platforms.Tese datasets were designed to evaluate the advantages anddisadvantages of various data analysis methods.

    Te initial results from theMAQCproject were publishedin Shis work [14] and later in Chen et al.s work [15] andShi et al.s work [16]. Specifcally, the MAQC experimentaldesign for Afymetrix gene expression HG-U133 Plus 2.0GeneChip includes 6 diferent test sites, 4 pools per site, and5 replicates per site, for a total of 120 arrays (see Section 2).Tis rich dataset provides an ideal setting for evaluatingimputation methods on Afymetrix expression arrays. Whilethis dataset has beenmined to determine inter-intra platformreproducibility ofmeasurements, to our knowledge, none hasstudied imputation methods on this dataset.

    Te MAQC dataset hybridizes two RNA sample typesUniversal Human Reference RNA (UHRR) from Stratageneand a Human Brain Reference RNA (HBRR) from Ambion.Tese 2 reference samples and varyingmixtures of these sam-ples constitute the 4 diferent pools included in the MAQCdataset. By using various mixtures of UHRR and HBRR, thisdataset is designed to study technical variations present inthis technology. By technical variations, we are referring tothe variability between preparations and labeling of sample,variability between hybridization of the same sample to dif-ferent arrays, testing site variability, and variability between

    the signal on replicate features of the same array. Meanwhile,biological variability refers to variability between individualsin population and is independent of the microarray processitself. By theMAQCdataset being designed to study technicalvariation, we can examine the accuracy of the imputationprocedures without the confounding feature of biologicalvariability. Other than MAQC datasets, similar technicaldatasets have been used to evaluate diferent analysismethodsspecifc to Afymetrix microarrays, for example, methods foridentifying diferentially expressed genes [1719].

    In summary, our analysis examines cutting edge imputa-tion schemes on an Afymetrix technical dataset with min-imal biological variation. Section 2 discusses the MAQCdataset and the proposed imputation schemes. Meanwhile,Section 3 describes the results from applying the imputationmethods for addressing missingness in the MAQC datasets.Finally, we conclude our paper with a discussion and conclu-sion in Sections 4 and 5.

    2. Materials and Methods

    2.1. Datasets. Te MAQC experiments and datasets are fullydescribed by Shi [14]. Te MAQC dataset hybridizes 2 RNAsamples a Universal Human Reference RNA (UHRR) fromStratagene and a Human Brain Reference RNA (HBRR) fromAmbion. From these 2 samples, 4 pools are created, that is, the2 reference RNA samples as well as 2 mixtures of the originalsamples: Pool A, 100% UHRR; Pool B, 100% HBRR; Pool C,75%UHRR and 25%HBRR; and Pool D, 25%UHRR and 75%HBRR. Both Pool A and Pool B are commercially availableand biologically distinct where we expect a large number ofdiferentially expressed genes between Pool A and Pool B.

    Tere are 6 diferent test sites where each test site assayedthe 4 pools with 5 replicates per pool. Tus, for each test sitethere are a total of 20 arrays and thus a total of 120 arraysover the 6 sites. Te data is examined separately for eachpool (4) and each site (6) separately yielding 24 site and pooldatasets.

    2.2.Missing Values andDetection Call Algorithm. UsingMAS5.0, a detection call algorithm is used to fag the missingvalues [13]. Te detection call determines if the transcript ofa gene is present or absent in the sample. For every gene, themicroarray chip has probes that perfectly match a segmentof the gene sequence (PM probes) and probes that containa single mismatched nucleotide in the center of the perfectmatch probe (MM probes). Te diference in the intensity ofthe perfect and mismatch probes is used to make detectioncalls.

    Te detection call algorithm is further summarized byMei et al. [11]. For each genetic transcript, there is a probeset with 11 to 20 probe pairs where a probe pair consists ofa PM probe and MM probe. In short, discrimination scoresare calculated for each probe set from the raw intensity datafor the probe pairs in the probe set. For each probe pair, theratio of the sum and diference of the PM and MM probesgives the discrimination score for that probe pair. Tis scoreis calculated for all the probe pairs in a probe set. Te nullhypothesis is that the median discrimination score of a probe

  • Advances in Bioinformatics 3

    set is equal to , and the alternate hypothesis is that themedian discrimination score is greater than , where isdefned as a small nonnegative number which can be changedby the user to adjust the specifcity and sensitivity. One-sidedWilcoxon rank sum tests are performed for each probe set.Two signifcance levels 1 and 2, act as the cutofs for the values for probe set detection calls. A present call is made fora probe set (transcript) with a value

  • 4 Advances in Bioinformatics

    TeLLSmethod is a neighbor-based approach that selectsneighbors based on their Pearson correlation coefcient.Multiple regression is performed using -nearest neighborsas described by Kim et al. [5], and the LLS method is imple-mented using the R package pcaMethods [26]. Te methodrestricts to be less than the number of replicates/columns. Inour case, with 5 replicates, we chose equal to 1, 3, or 4. Globalbased methods, SVD [1] and BPCA [3], were implementedusing the R package pcaMethods [26]. Te NIPALS methodis summarized by Wold [2] and is implemented using theR package pcaMethods [26]. Similar to KNN, in order toimplement the NIPALS algorithm, it is necessary for the userto specify the number of principal components. To evaluatethe diferent methods of imputation, probe set expressionvalues were randomly deleted from the complete dataset, andthe summary measures in the next section were comparedacross the methods.

    2.6. Quantitative Error Evaluation. Te complete expressionmatrices for each pool and site are such that the rows corre-spond to probe sets, and the columns correspond to samples.Similar to Oh et al. [10], we denote this complete expressionmatrix as CD = (), where is the expression intensityof probe set (roughly speaking gene) on sample . Tosimulate the missing data, we randomly remove 5% or 10%of the entries in CD. Ten given a missing value imputationscheme, themissing value for probe set, sample , is imputedas and the imputed dataset is denoted as ID.

    To compare the imputed dataset ID with the completedataset CD, we employ the following summary statistics:

    (1) root mean squared error (RMSE),

    RMSE = 1no. of missing { missing}( )2, (1)

    (2) relative estimation error (RAE) [25],

    RAE = 1no. of missing { missing}

    () , (2)where

    () = {{{ , if > ,, if < , (3)

    (3) logged RMSE (LRMSE) [8],

    LRMSE = 1no. of missing { missing}( )2, (4)

    where = log (), and(4) RAE-L2 [10],

    RAE-L2 = 1no. of missing { missing}

    ( )2 . (5)

    Pool

    Pre

    sen

    t (%

    )

    51.0

    51.5

    52.0

    52.5

    53.0

    53.5

    54.0

    54.5

    55.0

    55.5

    56.0

    56.5

    57.0

    57.5

    58.0

    58.5

    59.0

    A B C D

    Site 1

    Site 2

    Site 3

    Site 4

    Site 5

    Site 6

    Average of percent present calls per site per pool

    Figure 1: Percent present across pools and sites. Each curve shows adiferent site, and the -axis shows the 4 pools and the -axis showsthe mean percentage of present probes on the Afymetrix arrays.Pool B has the smallest percentage of present probes, while Pool Dhas the largest percentage of present probes. Site 4 has the highestpercentage of present probes, while Site 2 has the lowest percentageof present probes.

    See Section 4 for the motivation for using these error mea-sures to evaluate the imputation methods. To understand thevariability in the imputation procedures, we perform eachmissing data simulation 1000 times.

    2.7. Ranking the Imputation Methods. To identify the overallbest and worst performing imputation methods (IM), werank the IM based on their average performance across thediferent error measures, all pools, and all sites. Te rankingprocedure is carried out separately for 5% and 10% deletion.

    For each simulation, we compute 4 error measures foreach of the 10 imputation methods. Averaging over the 1000simulations, we get an average error value for each imputationmethod for every site and pool combination. For example, forthe metric RMSE, there are 10 values: 1 for each imputationmethod at, say, Site 1 and Pool A.

    Ten, we rank the 10 IM based on each error measureseparately for each site and pool combination. For example,based on RMSE values, the IM are ranked from the lowest tohighest; the IM with lowest RMSE value is 1 and the IM withthe highest RMSE is 10. Te IM each have a rank value for agiven error measure at each site for each of the 4 pools.

    For every imputation method, the error measure rankvalues are averaged across the 6 sites for each pool; thus weobtain 4 average rank values, 1 for each pool. Finally, weaverage these 4 rank values to obtain a single number thatgives a global ranking to every imputation method, refecting

  • Advances in Bioinformatics 5

    Table 1: Summary of imputation methods for 5% and 10% deletion.

    Error metric RMSE LRMSE RAE RAEL2 Average

    Deletion % 5% 10% 5% 10% 5% 10% 5% 10% 5% 10%

    BPCA 2.38 2.33 5.5 5.71 5.13 5.46 5.63 5.67 4.66 4.79

    KNN1 9.79 9.79 9.88 10 9.83 10 9.79 9.88 9.82 9.92

    KNN5 7.83 7.83 9.08 8.88 8.33 8.75 8.42 8.79 8.42 8.56

    LLS1 5.17 5.21 4.29 4.25 4.67 4.5 4.38 4.33 4.63 4.57

    LLS3 3.83 3.75 2.17 2.25 2.13 2.17 2.17 2.17 2.57 2.58

    LLS4 2.79 2.92 2.29 2.92 1.96 1.96 2 2.08 2.26 2.48

    LSA 1 1 4.88 4.92 4.33 4.5 4.71 4.71 3.73 3.78

    NIPALS 7.25 7.25 7.33 6.96 7.33 7.08 7.33 7.04 7.31 7.08

    ROW 6 5.96 1.5 1.5 2 2 1.58 1.46 2.77 2.72

    SVD 8.96 8.96 8.29 8.08 8.29 8.17 8.33 8.29 8.47 8.38

    Rows correspond to imputation methods and columns correspond to error measures with the last columns showing the average across the error measures.Each imputation method is ranked based on its average rank performance across all pools and all sites.Te rank values for every error measure and imputationmethod combination are averaged across the 6 sites and 4 pools as detailed in Section2. Smaller average rank values suggest more accurate imputationmethods.From the table, we observe that RMSE metric suggests that LSA imputation method has the best performance. With LRMSE and RAEL2 metrics, ROW is thebest imputation method. LLS with = 4 (LLS4) has the best performance when we use the RAE error measure. KNN with = 1 (KNNl) has the highest rankvalue for any given error measure; thus, it is the worst performing imputation method. LLS with = 4 (LLS4) has the overall best performance across thediferent error measures. Tese results hold true for both 5% and 10% deletion.

    0

    200

    400

    600

    800

    1000

    1200

    1400

    RM

    SE m

    ean

    val

    ues

    1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M

    Test sites (16) and imputation methods

    BPCA KNN1 KNN5 LLS1 LLS3 LLS4 LSA NIPALS ROW SVD

    Figure 2: Average RMSE barplot with error bars. RMSE values are represented on the -axis.Te -axis has the 6 sites (1, 2, 3, 4, 5, and 6) and10 imputation tests (BPCA, KNN with = 1, 5, LLS with = 1, 3, 4, LSA, NIPALS, ROW, and SVD). Mean (M) depicted by the slashed bar isthe overall mean for individual IM where the RMSE values are averaged across the 4 pools and 6 sites. Tis fgure shows the performance ofthe 10 imputation tests using the RMSEmetric with 5% deletion of values. 1000 simulations were performed where each simulation generateda dataset containing 5% missing values by randomly removing probe set values from the complete expression matrix of probe sets. Missingvalues were imputed using the 10 imputation tests. Te results are compared using the RMSE metric (see Section 2). Te RMSE values areaveraged across the 4 pools. LSA has the best performance as it has the lowest RMSE value for a given site. KNN with = 1 has the highestRMSE value and has the worst performance for all pools and all sites.

  • 6 Advances in Bioinformatics

    0.0

    0.2

    0.4

    0.6L

    RM

    SE m

    ean

    val

    ues

    1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M

    Test sites (16) and imputation methods

    BPCA KNN1 KNN5 LLS1 LLS3 LLS4 LSA NIPALS ROW SVD

    Figure 3: Average LRMSE barplot with error bars. LRMSE values are represented on the -axis. Te -axis has the 6 sites (1, 2, 3, 4, 5, and 6)and 10 imputation tests (BPCA, KNN with = 1, 5, LLS with = 1, 3, 4, LSA, NIPALS, ROW, and SVD). Mean (M) depicted by the slashedbar represents the overall mean for individual IM where the LRMSE values are averaged across the 4 pools and 6 sites. Tis fgure showsthe performance of the 10 imputation tests using the RMSE metric with 5% deletion of values. 1000 simulations were performed where eachsimulation generated a dataset containing 5% missing values by randomly removing probe set values from the complete expression matrixof probe sets. Missing values were imputed using the 10 imputation tests. Te results are compared using the LRMSE metric (see Section 2).Te LRMSE values are averaged across the 4 pools. ROW has the best performance as it has the lowest LRMSE value for a given site. KNNwith = 1 has the highest LRMSE value and has the worst performance for all pools and all sites.its overall performance across diferent error measures, sites,and pools for a given deletion percentage.

    3. Results

    We summarize our fndings in two ways: probe set detectioncall summaries and errormetrics and rankings for IM.Detec-tion call results compare sites and pools while IM resultschoose the best imputation method based on the errormetrics discussed in Section 2.

    3.1. Detection Call Algorithm Results. Across the 120 samples,as shown in Figure 1 the percent present calls has a minimumvalue of 51% and a maximum value of 58.5%. We observethat Site 4 have the highest mean percent present calls andSite 2 has the lowest mean percent present calls for probesets. In terms of pools, Pool B has the lowest mean percentpresent calls for probe sets while Pool D has the highest meanpercent present calls (see Figure 1).We performed an analysisof variance (ANOVA) to examine the efects of site and poolon the percentage of present probe sets in amicroarray.Te values for site and pool are

  • Advances in Bioinformatics 7

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    RA

    E m

    ean

    val

    ues

    1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M

    Test sites (16) and imputation methods

    BPCA KNN1 KNN5 LLS1 LLS3 LLS4 LSA NIPALS ROW SVD

    Figure 4: Average RAE barplot with error bars. RAE values are represented on the -axis. Te -axis has the 6 sites (1, 2, 3, 4, 5, and 6)and 10 imputation tests (BPCA, KNN with = 1, 5, lls with = 1, 3, 4, LSA, NIPALS, ROW, and SVD). Mean (M) depicted by the slashedbar represents the overall mean for individual IM where the RAE values are averaged across the 4 pools and 6 sites. Tis fgure shows theperformance of the 10 imputation tests using the RAE metric with 5% deletion of values. 1000 simulations were performed where eachsimulation generated a dataset containing 5% missing values by randomly removing probe set values from the complete expression matrix ofprobe sets. Missing values were imputed using the 10 imputation tests. Te results are compared using the RAE metric (see Section 2). TeRAE values are averaged across the 4 pools. LLS with = 4 has the best performance as it has the lowest RAE value for a given site. KNNwith = 1 has the highest RAE value and has the worst performance for all pools and all sites.

    performing imputation method across all pools and sites.LLS with = 4 has the overall best performance across thediferent error measures.

    Figures 2, 3, 4, and 5 show the performance of diferentimputation methods for each error measure for all the pooland site combinations for 5% deletion. Further supplementalfgures and tables showing the performance of diferentimputationmethods on each site and pool asmeasured by the4 error measures are found in Sadasiva Rao et al.s work [27].Results from 5% deletion and 10% deletion show a similarpattern. As expected the imputed values and variance with10% missing data are larger than 5% missing data. Site 4 hasthe highest values for most of the imputation tests for all thesamples (see Sadasiva Rao et al.s work [27] for more details).Ultimately, LLS with = 4 has the best performance with 10%deleted values.

    4. Discussion

    Te MAQC project allows researchers to study a variety ofmicroarray aspects including comparisons of one-color andtwo-color arrays [28], reproducibility [14, 15, 29], removal ofbatch efects [30], and determining diferentially expressed

    genes [31]. From this diverse research, it is clear thatthe MAQC projects represent a fertile testing ground formicroarray inspired algorithms and methods. However, todate, we are not aware of any work examining imputationmethods on the MAQC datasets.

    Our conclusion is that LLS with = 4 has the bestperformance given our set of error measures. We note thatthe optimality of LLS with = 4 is not uniform acrossall error measures, sites, and pools. Also, in Figures 25, itis clear that several other imputation methods ofer similarperformance to LLS with = 4, for example, LLS with =1, 3, LSA, and BPCA.Tese results are similar to those foundby Brock et al. [8] and commented on by Aittokallio [32]concluding that the top performing imputation algorithms(LS, LLS, and BPCA) are all highly competitive with eachother, but no method is uniformly superior in all analyses. Tothat end, Brock et al. [8] develop measures to determine theappropriate (optimal) imputation method for a given datasetbased on the correlation within the dataset.

    We choose a set of cutting edge imputation schemes toapply in the MAQC datasets. Tere are numerous appliedreferences for the imputation schemes including [710, 3234]. Optimality in the imputation schemes was assessed

  • 8 Advances in Bioinformatics

    BPCA KNN1 KNN5 LLS1 LLS3 LLS4 LSA NIPALS ROW SVD

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    RA

    EL

    2 m

    ean

    val

    ues

    1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M

    Test sites (16) and imputation methods

    Figure 5: Average RAEL2 barplot with error bars. RAEL2 values are represented on the -axis. Te -axis has the 6 sites (1, 2, 3, 4, 5, and 6)and 10 imputation tests (BPCA, KNN with = 1, 5, LLS with = 1, 3, 4, LSA, NIPALS, ROW, and SVD). Mean (M) depicted by the slashedbar represents the overall mean for individual IM where the RAEL2 values are averaged across the 4 pools and 6 sites. Tis fgure shows theperformance of the 10 imputation tests using the RAEL2 metric with 5% deletion of values. 1000 simulations were performed where eachsimulation generated a dataset containing 5% missing values by randomly removing probe set values from the complete expression matrixof probe sets. Missing values were imputed using the 10 imputation tests. Te results are compared using the RAEL2 error measure (seeSection 2). Te RAEL2 values are averaged across the 4 pools. ROW has the best performance as it has the lowest RAEL2 value for a givensite. KNN with = 1 has the highest RAEL2 value and has the worst performance for all pools and all sites.via (1) raw score error measures and (2) rank-based errormeasures taken across our cohort of error measures. Teerror measures chosen (see Secton 2) were designed to assess(1) errors in raw expression values (RMSE), (2) errors inthe logarithm transformed expression values (LRMSE), (3)relative errors designed to penalize errors relative to the rawexpression values (RAE), and (4) relative errors designedto penalize the error relative to the logarithm expressionvalue (RAEL2). Hence, there are 2 (relative and absolute)error measures based on raw expression scores and 2 errormeasures (relative and absolute) based on the logarithm ofexpression values. Because of this balanced design in errormeasures between relative and absolute measures and rawand logarithm transformed data, it is reasonable to computethe average rank across these error measures to assess theoverall quality of an imputation method (see Table 1). Tus,these rank-based errormeasures shown in Table 1 summarizethe results in a straightforward manner across sites, pools,and error measures. Note that we set = 0.20 for the RAEerror method. For future work, our group is interested instudying the robustness of RAE to the choice of . We alsoinclude the raw score error measures to demonstrate the bestimputation methods regardless of the employed set of theimputation methods (see Figures 25).

    Our study is designed with the technical MAQC datasetin mind. Tus, our error measures do not include biologicalmeasures of the type discussed in [10]. Tese biological mea-sures are designed to study the clustering and classifcationschemes commonly applied to gene expression microarrays.While our summary error measures are important to com-pare the imputation schemes, it is not clear how the diferentimputation procedures will afect downstream biologicalanalysis and interpretation. It is outside of the scope of thispaper to address this biological question since the MAQCexperiment does not represent a real biological experiment.

    We study imputation methods while using the MAS 5.0algorithm as the preprocessing method. However, there areother preprocessing algorithms such as RMA [3537] andGCRMA [38] that are routinely used, and these methodsmay infuence the performance of the imputation scheme.Wehighlight several works that extensively study and comparepreprocessing schemes for Afymetrix datasets including [17,18, 39, 40]. It is of future work to compare imputationmethods across diferent preprocessing algorithms.

    We recognize that the MAQC datasets are not withoutcriticism. For example, the issue of choosing an overall opti-mal preprocessing scheme is still an open question [41].Another serious criticism is provided in [42] with a reply by

  • Advances in Bioinformatics 9

    Shi et al. [43]. In that discussion, one of the main concernsinvolves technical versus biological variation.Tis importantissue has arisen when studying other technical microarraydatasets [39]. Considering both aspects of this question, ifwe use datasets containing biological and technical variation,that is, datasets designed to answer biological questions, thenthere are biases due to the intent of the original datasets(e.g., biological variation of the species, sample preparation,procurement of RNA, and hybridization afnities).

    5. Conclusions

    Missing values in microarray experiments are a commonproblemwith efects on downstream analysis. Many variablessuch as the biological variability of the dataset, experimentalconditions of the study, percentage of missing values, andtype of downstream analysis performed need to be consid-ered when choosing an imputation method.

    In our work, we use theMAQCdatasets with theMAS 5.0preprocessing scheme to compare missing data imputationschemes for Afymetrix datasets.Te best andworst perform-ing imputation schemes remain the same for both 5% and 10%deletion percentages. We observe that -nearest neighbormethod with = 1 has the worst performance among theimputation schemes across all error measures. Local leastsquares (LLS) method with = 4 gives the best performancefor imputing missing values across all error measures forboth 5% and 10% deletion. Tese conclusions are based onstudying 10 imputation methods with 4 error metrics and1000 Monte-Carlo simulations.

    Authors Contribution

    Jefrey C. Miecznikowski and Song Liu designed the study.Sreevidya Sadananda Sadasiva Rao performed the statisticalanalysis. Sreevidya Sadananda Sadasiva Rao and Jefrey C.Miecznikowski wrote the paper. Lori A. Shepherd andAndrew E. Bruno assisted with the data analysis and writingthe paper. All authors read and approved the fnal paper.

    References

    [1] O. Troyanskaya, M. Cantor, G. Sherlock et al., Missing valueestimation methods for DNA microarrays, Bioinformatics, vol.17, no. 6, pp. 520525, 2001.

    [2] H. Wold, Path models with latent variables: the NIPALSapproach, in Quantitative Sociology: International Perspectiveson Mathematical and Statistical Modeling, pp. 307357, 1975.

    [3] S. Oba, M. Sato, I. Takemasa, M. Monden, K. Matsubara, andS. Ishii, A Bayesian missing value estimation method for geneexpression profle data, Bioinformatics, vol. 19, no. 16, pp. 20882096, 2003.

    [4] T. H. B, B. Dysvik, and I. Jonassen, LSimpute: accurate esti-mation of missing values in microarray data with least squaresmethods, Nucleic Acids Research, vol. 32, no. 3, p. e34, 2004.

    [5] H. Kim, G. H. Golub, and H. Park, Missing value estimationfor DNA microarray gene expression data: local least squaresimputation, Bioinformatics, vol. 21, no. 2, pp. 187198, 2005.

    [6] M. Ouyang, W. J. Welsh, and P. Georgopoulos, Gaussian mix-ture clustering and imputation of microarray data, Bioinfor-matics, vol. 20, no. 6, pp. 917923, 2004.

    [7] J. C. Miecznikowski, S. Damodaran, K. F. Sellers, D. E. Coling,R. Salvi, and R. A. Rabin, A comparison of imputation proce-dures and statistical tests for the analysis of two-dimensionalelectrophoresis data, Proteome Science, vol. 9, p. 14, 2011.

    [8] G. N. Brock, J. R. Shafer, R. E. Blakesley, M. J. Lotz, and G. C.Tseng, Which missing value imputation method to use inexpression profles: a comparative study and two selectionschemes, BMC Bioinformatics, vol. 9, no. 1, p. 12, 2008.

    [9] M. Celton, A. Malpertuy, G. Lelandais, and A. G. de Brevern,Comparative analysis of missing value imputation methodsto improve clustering and interpretation of microarray exper-iments, BMC Genomics, vol. 11, no. 1, p. 15, 2010.

    [10] S. Oh, D. D. Kang, G. N. Brock, and G. C. Tseng, Biologicalimpact of missing-value imputation on downstream analyses ofgene expression profles, Bioinformatics, vol. 27, no. 1, ArticleID btq613, pp. 7886, 2011.

    [11] R. Mei, X. Di, T. B. Ryder et al., Analysis of high density expres-sion microarrays with signed-rank call algorithms, Bioinfor-matics, vol. 18, no. 12, pp. 15931599, 2002.

    [12] R. Gentleman, V. Carey, W. Huber, R. Irizarry, and S. Dudoit,Bioinformatics and computational biology solutions using Rand Bioconductor, Statistics for Biology and Health, 2005.

    [13] L. Gautier, L. Cope, B. M. Bolstad, and R. A. Irizarry, Afy-Analysis of Afymetrix GeneChip data at the probe level,Bioinformatics, vol. 20, no. 3, pp. 307315, 2004.

    [14] L. Shi, Te MicroArray Quality Control (MAQC) projectshows inter- and intraplatform reproducibility of gene expres-sion measurements, Nature Biotechnology, vol. 24, no. 9, pp.11511161, 2006.

    [15] J. J. Chen, H. Hsueh, R. R. Delongchamp, C. Lin, and C.Tsai, Reproducibility of microarray data: a further analysis ofmicroarray quality control (MAQC) data, BMCBioinformatics,vol. 8, no. 1, p. 412, 2007.

    [16] L. Shi, W. D. Jones, R. V. Jensen et al., Te balance of repro-ducibility, sensitivity, and specifcity of lists of diferentiallyexpressed genes in microarray studies, BMC Bioinformatics,vol. 9, supplement 9, p. S10, 2008.

    [17] S. E. Choe, M. Boutros, A. M. Michelson, G. M. Church, andM. S. Halfon, Preferred analysis methods for Afymetrix Gene-Chips revealed by a wholly defned control dataset, GenomeBiology, vol. 6, no. 2, p. R16, 2005.

    [18] Q. Zhu, J. C. Miecznikowski, and M. S. Halfon, Preferred anal-ysis methods for Afymetrix GeneChips. II. An expanded, bal-anced, wholly-defned spike-in dataset, BMC Bioinformatics,vol. 11, no. 1, p. 285, 2010.

    [19] Q. Zhu, J. C. Miecznikowski, and M. S. Halfon, A whollydefned Agilent microarray spike-in dataset, Bioinformatics,vol. 27, no. 9, Article ID btr135, pp. 12841289, 2011.

    [20] I. Afymetrix, Statistical algorithms description document,Technical Paper, 2002.

    [21] C. L.Wilson andC. J.Miller, Simpleafy: a BioConductor pack-age for Afymetrix Quality Control and data analysis, Bioinfor-matics, vol. 21, no. 18, pp. 36833685, 2005.

    [22] R. C. Gentleman, V. J. Carey, D. M. Bates et al., Bioconductor:open sofware development for computational biology andbioinformatics, Genome Biology, vol. 5, no. 10, p. R80, 2004.

    [23] T. Hastie, R. Tibshirani, B. Narasimhan, and G. Chu, Impute:Imputation for Microarray Data, 1999, R package version 1.10.0.

  • 10 Advances in Bioinformatics

    [24] T.H. BB, B. Dysvik, and I. Jonassen, Lsimpute: Accurate esti-mation of missing values in microarray data with least squaresmethods, 2005, http://www.ii.uib.no/trondb/imputation/.

    [25] D. V. Nguyen, N.Wang, and R. J. Carroll, Evaluation ofmissingvalue estimation for microarray data, Journal of Data Science,vol. 2, no. 4, pp. 347370, 2004.

    [26] W. Stacklies and H. Redestig, PcaMethods: A Collection of PCAMethods, 2007, R package version 1.18.0.

    [27] S. S. Sadasiva Rao, L. A. Shepherd, A. E. Bruno, S. Liu, and J.C. Miecznikowski, A full analysis of imputation procedures forAfymetrix gene expression datasets, Technical Report 1202,SUNY University at Bufalo-Department of Biostatistics, Buf-falo, NY, USA, 2012.

    [28] T. A. Patterson, E. K. Lobenhofer, S. B. Fulmer-Smentek et al.,Performance comparison of one-color and two-color plat-forms within the MicroArray Quality Control (MAQC)project, Nature Biotechnology, vol. 24, no. 9, pp. 11401150,2006.

    [29] Z. Wen, C. Wang, Q. Shi et al., Evaluation of gene expressiondata generated from expired Afymetrix GeneChip microar-rays using MAQC reference RNA samples, BMC Bioinformat-ics, vol. 11, supplement 6, p. S10, 2010.

    [30] J. Luo, M. Schumacher, A. Scherer et al., A comparison of batchefect removal methods for enhancement of prediction per-formance using MAQC-II microarray gene expression data,Pharmacogenomics Journal, vol. 10, no. 4, pp. 278291, 2010.

    [31] K. Kadota and K. Shimizu, Evaluating methods for rankingdiferentially expressed genes applied to microArray qualitycontrol data, BMC Bioinformatics, vol. 12, no. 1, p. 227, 2011.

    [32] T. Aittokallio, Dealing with missing values in large-scale stud-ies: microarray data imputation and beyond, Briefngs in Bioin-formatics, vol. 11, no. 2, Article ID bbp059, pp. 253264, 2009.

    [33] J. Tuikkala, L. L. Elo, O. S. Nevalainen, and T. Aittokallio, Miss-ing value imputation improves clustering and interpretation ofgene expression microarray data, BMC Bioinformatics, vol. 9,no. 1, p. 202, 2008.

    [34] A. Liew, N. Law, andH. Yan, Missing value imputation for geneexpression data: computational techniques to recover missingdata from available information, Briefngs in Bioinformatics,vol. 12, no. 5, Article ID bbq080, pp. 498513, 2011.

    [35] B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed, Acomparison of normalizationmethods for high density oligonu-cleotide array data based on variance and bias, Bioinformatics,vol. 19, no. 2, pp. 185193, 2003.

    [36] R. A. Irizarry, B.M. Bolstad, F. Collin, L.M. Cope, B.Hobbs, andT. P. Speed, Summaries of Afymetrix GeneChip probe leveldata, Nucleic Acids Research, vol. 31, no. 4, p. e15, 2003.

    [37] R. A. Irizarry, B.Hobbs, F. Collin et al., Exploration, normaliza-tion, and summaries of high density oligonucleotide array probelevel data, Biostatistics, vol. 4, no. 2, pp. 249264, 2003.

    [38] Z.Wu, R. A. Irizarry, R. Gentleman, F. Martinez-Murillo, and F.Spencer, A model-based background adjustment for oligonu-cleotide expression arrays, Journal of the American StatisticalAssociation, vol. 99, no. 468, pp. 909917, 2004.

    [39] A. R. Dabney and J. D. Storey, A reanalysis of a publishedAfymetrix GeneChip control dataset, Genome Biology, vol. 7,no. 3, p. 401, 2006.

    [40] D. P. Gaile and J. C. Miecznikowski, Putative null distributionscorresponding to tests of diferential expression in the GoldenSpike dataset are intensity dependent, BMC Genomics, vol. 8,no. 1, p. 105, 2007.

    [41] J. M. Perkel, Six things you wont fnd in the MAQC, TeScientist, vol. 20, no. 11, p. 68, 2007.

    [42] P. Liang, MAQC papers over the cracks,Nature Biotechnology,vol. 25, no. 1, pp. 2728, 2007.

    [43] L. Shi, W. D. Jones, R. V. Jensen et al., Reply to MAQC papersover the cracks, Nature Biotechnology, vol. 25, pp. 2829, 2007.

  • Research Article

    A Multilevel Gamma-Clustering Layout Algorithm for

    Visualization of Biological Networks

    Tomas Hruz,1

    Markus Wyss,2

    Christoph Lucas,1

    Oliver Laule,2

    Peter von Rohr,2

    Philip Zimmermann,2

    and Stefan Bleuler2

    1 Institute of Teoretical Computer Science, ETH Zurich, 8092 Zurich, Switzerland2NEBION AG, Hohlstrae 515, 8048 Zurich, Switzerland

    Correspondence should be addressed to Markus Wyss; [email protected]

    Received 12 April 2013; Accepted 7 June 2013

    Academic Editor: Guohui Lin

    Copyright 2013 Tomas Hruz et al. Tis is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

    Visualization of large complex networks has become an indispensable part of systems biology, where organisms need to beconsidered as one complex system. Te visualization of the corresponding network is challenging due to the size and density ofedges. Inmany cases, the use of standard visualization algorithms can lead to high running times and poorly readable visualizationsdue to many edge crossings. We suggest an approach that analyzes the structure of the graph frst and then generates a new graphwhich contains specifc semantic symbols for regular substructures like dense clusters. We propose a multilevel gamma-clusteringlayout visualization algorithm (MLGA) which proceeds in three subsequent steps: (i) a multilevel -clustering is used to identifythe structure of the underlying network, (ii) the network is transformed to a tree, and (iii) fnally, the resulting tree which showsthe network structure is drawn using a variation of a force-directed algorithm. Te algorithm has a potential to visualize verylarge networks because it uses modern clustering heuristics which are optimized for large graphs. Moreover, most of the edges areremoved from the visual representation which allows keeping the overview over complex graphs with dense subgraphs.

    1. Introduction

    Te development in systems biology has brought a stronginterest in considering an organism as a large and complexnetwork of interacting parts. Many subsystems of livingorganisms can be modeled as complex networks. One impor-tant example is a network of biochemical reactions whichconstitutes a complex system responsible for homeostasis inthe living cell. An abstract network model of the biochemicalprocesses within the cell can be constructed such that reac-tions are represented as nodes andmetabolites (and enzymes)as edges. In the past, this system was studied mainly on asubsystem level through metabolic pathways. Recently, it hasbecome important to consider the metabolic system as onecomplex network to understand deeper phenomena involv-ing interactions across multiple pathways.

    Te need to study the whole network consisting of thou-sands of reactions, metabolites, and enzymes requires avisualization system allowing biologists to study the overall

    structure of the system. Such a visualization should allownav-igation and comprehension of the global system structures.In the present paper, we propose a visualization algorithmfor very large networks arising in systems biology and weillustrate its usage on two complex biological networks. Tefrst case study is a metabolic network ofArabidopsis thalianaand the second case study is a gene correlation network ofMus musculus based on mRNA expression measurements.

    Biological networks are usually represented as graphsbecause such model can provide an insight into their struc-ture.Te goal of the subsequent visualization is to present theinformation contained in the graph in a clear and structuredway. For instance, closely related nodes of a subsystem shouldbe positioned together. Tis can be achieved using a costfunction which formalizes the visualization criteria andwhich controls the drawing algorithm. Several standardalgorithms exist to achieve this goal using continuous opti-mization of the cost function, but the optimization of adiscrete cost function remains hard to solve.

    Hindawi Publishing Corporation

    Advances in Bioinformatics

    Volume 2014, No. 1, June 2014doi:10.1155/2012/705435

  • 12 Advances in Bioinformatics

    (a) (b)

    Figure 1: Arabidopsis thalianametabolic network visualized with (a) a force-directed algorithm with all edges shown, (b) the MLGAmethodwhich combines -clustering with the force-directed algorithm. Te underlying network has 1199 reactions (nodes) and 4386 metabolites(edges).

    A widely used graph drawing method for larger graphsis the force-directed layout algorithm [1, 2]. Basically, thegraph is modeled as a physical system. A force is calculatedon every node: a repulsive force between every pair of nodesand an attractive force if an edge exists between two nodes.Te forces direct the system into a steady state which defnesa fnal layout. However, themethod has several disadvantagesfor large graphs with many edges. First, a straightforwardimplementation needs to calculate the forces between eachnode-pair in each iteration. Second, for complex graphs toomany iterations are needed to fnd an optimal layout. Tird,a drawback results from the node degree distribution in bio-logical networks which tends to be skewed (scale-free). Fewnodes have a high degree while a large number of nodes havea small degree. Te attractive forces will stick together thenodes with many interactions in a small area which preventsthe identifcation of the network structure in the dense partsof the network; see Figure 1(a). Te repulsive force againstthe other nodes leads to a scattered layout. To overcomethese disadvantages, other graph visualization methods haveemerged which are discussed later. On the other hand, for avery specifc class of graphs like trees, a modifed version offorce-directed algorithm can be still a suitable method.

    Te visualization of very large biological networks wasconsidered in [3]. Te large graph layout algorithm (LGL)separates the graph into connected components, lays outeach connected component separately, and integrates theselayouts into one coordinate system. A grid variant of thespring-based algorithm [1] is used to draw the graph foreach connected component. To separate dense parts of eachcomponent, the minimum spanning tree (MST) is calculatedto defne the order in which nodes are included in thelayout computation. Beginning from a root node of the MST,new nodes with increasing edge distance from the root are

    iteratively added to the layout. Te new nodes are placedrandomly on spheres away from the current layout. Ateach iteration, the spring-based layout algorithm is executeduntil the layout is at rest. Under certain conditions, thisnode placement strategy reduces cluttering and retains thestructure of core components; moreover, it separates highlyconnected components. Tis layout phase is illustrated in [3,page 181, Figure 1]. However, in some situations, the LGLalgorithm can even obfuscate the true structure of the graph.Consider the situation where in the graph two cliques areconnected by a matching. Te MST algorithm will representthis subgraph as a star having many paths of length twofrom the center and one path of length three leading to thecenter of the second clique. Te rendering according to LGLwould lead to a situation where the frst clique is placed in theinterior of the second clique. Such starting confguration caneasily lead to a situation where the force-directed algorithmcannot separate the cliques;moreover, the edges of the secondclique would cross the rest of the graph.Te problem is that inthis situation theMST algorithm reduces the second clique toone edge. In such cases a diferent solution would be neededas we describe later.

    Te problem of fast visualization for protein interactionnetworks was studied in [4]. Te method uses an approachwith a grouping phase, and a layout phase. In the groupingphase the algorithm identifes the connected components ofthe graph and uniformly selects pivot nodes in each compo-nent. Te selection of the pivot nodes is controlled by a set ofrules based on empirical parameters. In the layout phase, thepivot nodes defne an initial layout of the connected compo-nents. Aferwards, the layout of each connected component isrefned separately.Te authors show that the method is fasterthan many other algorithms; however, a certain disadvantageof this algorithm is the choice of pivot nodes involving

  • Advances in Bioinformatics

    many parameters and a complex set of rules. Te rules andits parameters are heuristically identifed to give a uniformdistribution of the nodes within the connected component.Another drawback is that the method per se cannot visualize,the structure of dense subgraphs because of too many edgecrossings (see [4, page 1887, Figure 3]). To improve the visu-alization the authors introduce visual operations to collapsethe cliques (and complete bipartite subgraphs) to reduce thenumber of edges and nodes. Additionally, the problem offnding maximal clique (or complete bipartite subgraph) isNP hard together with its approximation there is almostno chance to have fast identifcation heuristics for largegraphs. Our algorithm improves the situation in this respectbecause relaxing requested density of the subgraph through-clustering (where 0 1 is the cluster density) allowsmuch more efcient heuristics for large graphs (order 106nodes and edges [5]).

    A global optimization method was explored in [6] wherethe authors describe a layout algorithm for metabolic net-works. Nodes of the graph are placed on a square grid. Adiscrete cost function between a pair of nodes is introducedbased on their relation and position on the grid. By min-imizing the total cost, a layout is generated. A simulatedannealing heuristic is used to optimize the cost functionby choosing better layouts among possible candidates. Dueto the computationally costly calculation of the layout, theapproach is applicable to networks with a few hundred nodesonly. Te authors showed that the algorithm works well onsparse or planar graphs and clarifes the network structure asthe cost function of the method places closely related nodestogether. But this layout algorithm would place dense partsof the graph in the same area leading to many edge crossings.Additionally, as no reduction in the number of edges or nodesis performed, the identifcation of the graph structure wouldbe very hard for large graphs with many edges.

    2. MLGA Approach

    Te experience with the existing visualization methods hasshown that it is necessary to provide a structural view ofdense networks. Representing networks with a large numberof nodes and edges in a two-dimensional area results in manyedge crossings. Dense subgraphs prevent the recognition ofthe network structure if drawn directly. Apart from othertechnical problems, this is the main shortcoming of mostlayout algorithms. We believe that the future progress invisualization of large and dense networks lies in algorithmswhich analyze the structure of the graph frst and then gen-erate a new graph which contains specifc semantic symbolsfor regular substructures like dense clusters. Additionally,the algorithms may allow for drilling down and interactivelyshow all edges for a given substructure, described below (seesection visual representation and operation). Dense clustersare ideal candidates for graph preprocessing because theycan be simply described, efciently searched, and if theyare replaced with a specifc symbol they signifcantly reducethe complexity of the resulting low-dimensional (planar orthree dimensional) picture because they contain most of theedges. Moreover, we focus on the graph clustering algorithms

    because the underlying dimension of graphs can be very highproviding difculties for other clustering algorithms.

    Graph clustering is a large feld with many algorithmsdeveloped over the years [7]; however, there is no universalsolution for all cases. Even a defnition of a cluster comes inmany favors with diferent algorithmic consequences.Tere-fore, it is important to consider a certain class of graphs whichis sufciently general in the context of bioinformatics butallows for using an efcient clustering method. Recently newclustering methods emerged based on the idea of so-called-clusters [5] or (, )-clusters [8]. Tese methods use fastheuristics which allow for clustering efciently large graphs.Te existence of such methods inspired the general ideabehind our research to use clustering algorithms to build ahierarchical structure of a given graph which can be muchbetter visualized and which tells the users more about thestructure of the underlying biological network. In the fol-lowing, we focus on -clusters but other graph clusteringmethods could be used as well.

    3. Algorithm

    Te MLGA method introduces multilevel -clustering and aspecifc tree transformation with a force-directed layout algo-rithm to visualize the structure of highly complex biologicalnetworks. First, the original graph is preprocessed using a -clustering algorithm described in [5] to identify the clusters.For every cluster, a new cluster node is created and these newnodes are linked with new edges if there are edges betweenthe underlying cluster nodes as illustrated in Figure 2.

    Tis process constructs the frst hierarchical layer abovethe original graph. Ten, the clustering algorithm is recur-sively applied to the cluster nodes itself to generate a clusterhierarchy. Aferwards, this hierarchy is transformed to a treeshowing only the shortest paths from a root node through theintermediate cluster nodes to the nodes of the initial graph.Finally, a modifed version of the force-directed algorithmvisualizes the tree structure of the remaining graph. Tiscombination of preprocessing and layout algorithm eases theidentifcation of the cluster structure and their interactions,see Figure 1(b).

    For the clustering step, we prefer -clustering to (, )-clustering or to other more complex methods because itwould bemuchmore difcult to control the clustering param-eters during the transitions between the hierarchy levels. Teonly parameter which has to be specifed for our algorithm atevery hierarchical level is the parameter . It can be seen thatthe density of the graph grows when the algorithm proceedsto the higher levels. On the other hand, the number of nodesdecreases very rapidly so that afer few steps there is onlyone clique lef. As a consequence, it is not meaningful to usethe same clustering parameters as the algorithm recursivelyproceeds up the hierarchy. For more complex clustering algo-rithms, it would be very difcult to defne a good clusteringparameters if the parameter space has more dimensions.In our case, the sequence of the values for the parameter must be growing. As we discuss later, the actual valuescan be empirically determined and moreover 3-4 values aresufcient for large graphs.

    13

  • 14 Advances in Bioinformatics

    (a) (b)

    Figure 2: (a) Te construction of a cluster hierarchy and (b) the transformation to a tree.

    4. Algorithmic Phases

    Let = (, ) be an undirected graph with the vertex set and edge set . A -cluster for 0 1, also described as-clique or dense subgraph, is a subset such that for itsedge set () and the vertex set () the following is true:

    | ()| (| ()|2 ) . (1)Finding a -clique of maximal cardinality in is the

    maximal -clique problem. Te 1-clique problem is NP hardand is proved to be hard to approximate [9].

    To identify the clusters on one hierarchical level, we usea heuristic developed in [5] to detect -clusters for verylarge graphs. Reference [5] introduced a potential functionon a vertex set relative to a given -cluster and derived analgorithm to discover maximal -clusters.Te time complex-ity of the algorithm is (||||2) with the set of vertexesof the maximal -cluster detected. Further, the authors usea greedy randomized adaptive search procedure (GRASP)version of the algorithm with edge pruning. Te feasibilityof the resulting method was demonstrated by applying it totelecommunication data with millions of vertexes and edges.

    5. -Cluster DetectionTo fnd all -clusters on one level in the graph a variant(Algorithm 1) of the GRASP approach of [5] is used. Tecluster construction procedure is the nonbi-partite case for fnding a high cardinality cluster of specifeddensity in a graph with nodes and edges. Our algorithmrepeatedly applies the detection algorithm to the highesthierarchical level of the new graph. It terminates if no more-clusters are found or the number of clusters with a cluster-size below a given minimum size is reached.

    6. Hierarchy Creation

    Te cluster detection algorithm is repeatedly applied to thegraph and the clusters to build a hierarchy; see Figure 2(a).Each node of the graph has an attribute level which is

    input: : Verticesinput: : Edgesinput: : density of clusterbegin

    initialize empty list of clusters C;count 0;cluster construct dsubg(,V, E);while cluster = 0 count

  • beginlevel 0;nodes getNodes(level); getGammaValue(level);clusters createClusters(nodes, );while clusters = 0 do

    create one node on the next level for each cluster;create edges between clusters and nodes;level level + 1;nodes getNodes(level); getGammaValue(level);clusters createClusters(nodes, );

    endend

    Algorithm 2: createMultiLevelClusters.

    7. Tree Transformation

    To gain the structure of the cluster hierarchy, a tree transfor-mation is performed; see Figure 2(b). In the transformation(Algorithm 3), a hidden root node is connected to all clusternodes at the highest level as their parent. Aferwards, onlythe edges belonging to the shortest path from the root nodeto each node is shown. If the shortest path is not unique apath will be chosen at random. Te distance for each node iscalculated beginning from the root using a breath-frst search.Te parent of a node will be set to the neighbor node with theshortest distance. If the node belongs to a cluster node at onelevel above, the parent is set to this cluster.

    8. Layout Algorithm

    A modifed version of a force-directed algorithm [2] is usedto lay out the transformed graph. Our method introducesdiferent edge length on each level. Longer edges are assignedto higher levels than on lower levels. Tis results in anatural visualization of the hierarchy. Furthermore, the initialpositions of the nodes are specifcally calculated.Te nodes ofthe graph are located on concentric circles with the hiddenroot node at the center. Nodes immediately connected tothe root are positioned at the next inner circle and so on. Asegment of the circle is assigned to each node within which itslocation is calculated. Recursively, a fraction of this segmentis assigned to the children of the node on the next circle.Tis initial setup reduces the rendering time and guides thelayout algorithm to visualize the tree structure. A randominitial positioningmay result in a localminimumof the force-directed layout with many edge crossings which would dis-rupt the tree representation. Additionally, the repulsive forcesare ignored beyond a given distance depending on the sizeof the drawing area. Tis restriction prevents disconnectedcomponents of the graph from separating too far. To suppressthe well-known oscillation problem [10] of force-directedalgorithms a dumping heuristics is usedwherewe compute anaverage of the previous node positions during the forcecalculation.

    begincreate root;set parent of highest level nodes to root;candidates highest level nodes;foreach node candidates do

    if node belongs to a cluster one level above thennode.parent cluster;

    elseset node.parent to the neighbor with shortestdistance to the root;

    endnode.dist node.parent.dist + 1;foreach neighbor of node except node.parent do

    if neighbor has already been visited thenhide edge;

    elsecandidates candidates neighbor;

    endend

    endend

    Algorithm 3: treeTransformation.

    As the graph is transformed to a tree, other layoutalgorithms can be used in the this phase. Reference [11] usesa level-based approach which horizontally aligns nodes withthe same distance from the root node. As only a few levels arecreated for our initial graph the resulting drawing would havea much larger width than height. A ringed circular approachlike [12] where the children of the nodes are plotted on theperiphery of a circle has a better space efciency on the 2Dplane than [11]. But a visual inspection of the resulting graphin [12, page 11, Figure 7 (lef)] shows that the force-directedlayout distributes the children of a node more evenly.

    9. Visual Representation and Operation

    Afer the creation of the cluster hierarchy and the treetransformation many initial edge connections are hidden.In the presentation of the resulting graph the nodes of theinital graph are colored depending on the number of edgesin the initial graph. Additionally cluster nodes and edges arevisualized with diferent symbols and colors (Figure 3).

    Our implementation of the visualization tool ofers twooperations to get deeper insight into the original graph. First,all edges between a selected node and its direct neighborscan be highlighted (Figure 4(b)). If the marked node is acluster node, all connections to the nodes of the clusterwill be shown. During the tree transformation, most ofthese connections were eliminated and a direct connectionbetween two nodes in diferent clusters was replaced by anindirect connection between the cluster nodes. Te secondoperation will display all edges between the nodes forminga -cluster node (Figure 4(c)) which allows the user totemporarily alter the view between the star-shaped clusternode and the real connections of the cluster.

    Advances in Bioinformatics 15

  • 16 Advances in Bioinformatics

    Node of initial graph (degree 0)

    Node of initial graph (degree 1)

    Node of initial graph (degree 2)

    Node of initial graph (degree 3)

    Node of initial graph (degree 4)

    Node of initial graph (degree 5)

    Edge of initial graph

    Edge of a node belonging to a cluster

    Edge between cluster nodes of same level

    Cluster node level 1

    Cluster node level 2

    Node of initial graph (degree 6 or higher)

    Figure 3: Semantic symbols used in MLGA visualization. Nodesand edges are color encoded, nodes of the initial graph are coloredaccording to their degree and cluster nodes are enlarged at each levelof the hierarchy.

    10. Computation Speed and

    Memory Requirements

    Retrieving a -clique with the clustering algorithm of [5]has a running time of (||||2) with || the size of thedetected clique and || the size of the initial graph. As thealgorithm is recursively applied to the remaining nodes of

    the graph, a time complexity of at most (||3) results oneach level of the hierarchy. Te number of levels depends onthe number of clusters found on each level. It ranges fromthe worst case where two nodes are clustered together log2||down to 1 if all nodes are in the same cluster. Terefore, thetotal runtime order has an upper limit ((log2||)4) and alower limit (||3) . Te tree transformation of the resultinghierarchy uses breath-frst search. As newnodes and edges areintroduced during the hierarchy creation its runtime rangesfrom (4|| + ||) down to(|| + ||) in terms of the initalnumber of nodes || and initial number of edges ||.

    Te implemented version of force-directed layout algo-

    rithm needs a runtime of (||2). A specialized tree layoutalgorithm like [11] has a runtime of order (||) and [12] anorder (||) or (|| log ||) if an optimal solution for thecircle size is required.

    Te memory required by the algorithm mainly dependson the graph representation. For biological networks a repre-sentation between theworst case(4||+||) and(||+||)space is suitable.

    11. Results

    11.1. Metabolic Networks. To provide experimental justifca-tion of the proposed method, we extracted the metabolicnetwork for Arabidopsis thaliana from Genevestigator [13].Te network has 932 nodes and 2315 edges. Te edges

    represent metabolites and the nodes represent biochemicalreactions. We used two versions of the network as illustratedin Figure 5(a) where the second version in Figure 5(b) con-tains additionally the regulatory pathways with enzymes asedges leading to 1199 nodes and 4386 edges.

    Te application of multi-level -clustering to visualize theA. thaliana biochemical network (Figure 5(a)) revealed thatboth global view and lucidity were sustained, which is alsotrue when regulatory pathways were added to the network(Figure 5(b)). We looked at plant isoprenoid biosynthesis, inparticular at the synthesis of brassinosteroids (BRs), a class ofplant hormones which are essential for the regulation of plantgrowth and development [14, 15]. All individual reactionsteps, leading to BRs, were found structured according tometabolite fux through the pathway as part of a level 2cluster, also containing upstream pathways as well as reac-tions leading to other isoprenoid end products (Figure 5(a)).Afer inclusion of regulatory elements and signal-transduc-tion chains, multi-level -clustering assigned the biochem-ical reactions from brassinosteroid biosynthesis to clusterscontaining reactions known to be regulated by BRs andelements that are involved in the regulation of this biosyn-thetic pathway. As an example, the known fact that BRs actsynergistically with auxins to promote cell elongation [16] isnicely refected in the MLGA drawn network (Figure 5(b)).

    11.2. Gene Interaction Networks. Te MLGA method canalso be successfully used to analyze gene interaction net-works. Gene interaction networks are constructed as graphswhere nodes represent genes and edges represent interactionsbetween genes on various biological levels, as for example,interactions between the corresponding proteins or regula-tory and causal interactions obtained from gene expressionexperiments.Tere is a long-term research intomethods howto obtain networks which identify diferent kinds of geneinteraction networks based on diferent types of input data[1722]. However, as we illustrate later, even if a simplifednetwork generationmethod was used, our visualization algo-rithm was able to identify correctly the biologically meaning-ful subsystems from a genomewide correlation network.

    To generate the gene correlation networks, we used aMus musculus dataset from the Genevestigator database. Tedata consisted of 3157 publicly available Afymetrix arrays.Each array measured the expression values of 12488 genes.Te gene correlation matrix was calculated using the Pearsoncorrelation and aferwards a network was generated, wherean edge was introduced between two genes if the correlationvalue between the genes was above a certain threshold.Two networks were constructed using a threshold of 0.72 inFigure 6(a) and 0.80 in Figure 6(b).

    We use a well-known ribosomal gene complex [2325]to illustrate the possibilities of MLGA to discover interestingstructures in the correlation network. An inspection of thecluster highlighted in Figure 6(a) shows that it containsgenes which are documented to belong to the ribosomalcluster. Moreover, our method has a structural stability ina wide range of graph density. Tis can be seen comparingFigures 6(a) and 6(b). Figure 6(a) has lower threshold; there-fore, it contains 38 889 edges and 2774 nodes compared to

  • (a) (b) (c)

    Figure 4: (a) A part of a gene correlation network ofArabidopsis thaliana drawnwithMLGA, (b) showing all edges connected to the -clusternode at the top right and (c) displaying all edges between the nodes defning the cluster. Te inset shows a magnifcation of the edges of theselected cluster.

    (a) (b)

    Figure 5: MLGA applied to (a) A. thaliana biochemical network without signaling efects and regulatory elements. Reactions directlyinvolved in the synthesis of brassinosteroids are highlighted with the red color and direct connections are depicted by red edges. Te level2 cluster, indicated by an arrow, combines the major parts of isoprenoid biosynthesis, resulting from the nonmevalonate pathway. (b) A.thaliana biochemical network including signaling efects and regulatory elements. Reactions directly involved in brassinosteroid and auxinmetabolism/signaling are highlighted with red and direct connections are depicted by red edges. Black arrows point to reactions involved inbrassinosteroid metabolism/signaling. Green arrows point to reactions involved in auxin metabolism/signaling.

    8659 edges and 1232 nodes in Figure 6(b). In both cases theribosomal cluster can be clearly identifed.

    12. Discussion

    Visualization methods ofen contain parameters which mustbe empirically identifed. In [4], the selection of pivot nodes is

    determined by a set of empirical values to achieve a uniformdistribution of these nodes in the network. Aferwards,the layout is computed with respect to the selected pivotnodes. In our method, the only important empirically setparameters are the cluster densities on each hierarchicallevel. Tis infuences the granularity of the visualization andthe subsequent tree transformation supports the recovering

    Advances in Bioinformatics 17

  • 18 Advances in Bioinformatics

    (a) (b)

    Figure 6: MLGA applied to (a)Mus musculus gene correlation network generated with a threshold of 0.72. (b)Te gene correlation networkgenerated with a threshold of 0.80. Te red highlighted nodes and direct connections belong to the ribosomal cluster.

    of the network structure. Te computational experience hasshown that it is recommended to use a slightly smaller value in the frst than in the subsequent levels; we use 0.5 and0.7, respectively. Te density of the graph grows considerablyover the hierarchical levels. In most of the experiments witha graph sizing up to 105 edges the third level was already aclique. Terefore, very few values are needed even for largegraphs.

    A diferent approach in a similar direction as our researchis described in [26]. Te authors consider weighted graphswhere vertexes with degree 1 are repeatedly removed untilno more vertex of degree 1 exists. Te removed nodes can beadded at the end of the drawing algorithm. Afer that, a hier-archy of clusters is calculated on the remaining graph usingan approximation of the graph distances. A cluster is formedif the pairwise shortest path of its nodes is equal to or above athreshold which depends on the hierarchy level. Tis leads todiferent defnition of clusters than in our algorithm togetherwith a diferent cluster heuristics. Te algorithm preservesmore edges than in our casemaking the structure of the graphmore difcult to identify. Comparing the result of MLGAFigure 1(b) with [26], the structure of graphs in [26] is onlypartially resolved (see, e.g., [26, page 2, Figure 1]). Similarlyas in our case a version of force-directed algorithm is usedin the last stage to refne the visualization. A higher weight isassigned to the edges between the nodes of a cluster than tothe edges between clusters.Tis forces the nodes of a cluster tobe drawn close to each other. In our approach, the additionalcluster node and the desired edge length have a similar efect.

    In addition to complexity problems for large graphs (NP-hard approximation), the algorithms based on identifcationof cliques have also a drawback that it is ofen not clear whichcliques are relevant. Tis problem is particularly presentin cases of graphs with dense subgraphs, where we obtain

    a system of large cliques with similar sizes which haveadditionally large intersections. In particular, in the presenceof noise where every measurement defnes a graph whichdifers in a small percentage of edges, it is difcult to decideduring visualization based on cliques which part of the graphcan be emphasized as structurally important. An interestingdevelopment is represented by [27] where the authors con-centrate on intersections of large clusters under a conditionthat these intersections are cliques. Tey identify so-calledatom subgraphs which represent clique minimal separatordecomposition (a separator is a set of vertexes whose removalwill disconnect the graph into several parts). However, therelaxation from cliques to dense clusters in our methodimproves also on the intersection problem because ourmethod would glue the cliques with large intersection intoone cluster.

    13. Conclusion

    As discussed, many approaches try to improve the layout ofcomplex networks through better placement of the nodesalone. In our work, we pursue a diferent line of researchtowards efcient visualization algorithms for large biologicalnetworks. Our approach does not aim at rendering all edgesin a network, but we focus on the discovery and visualizationof important structural features. Tis approach is combinedwith complementary visual operations which allow to drill-down into the details of structurally identifed elements. TeMLGA method is successful in identifying the biologicallyrelevant structures and allows for processing very largegraphs as we illustrated on two diferent case studies of bio-logical networks. Naturally, this paradigm opens new ques-tions on how to further improve the visualization outputand speed. Diferent clustering algorithms can be tried to

  • create the multi-level structure; however, in the case ofmultiparameter clustering the control and analysis of theparameter values between the levels would become moredifcult.

    On the theoretical side, the next question to consideris how to provide a provably good (optimal) sequence of values. Another question is whether the surprisingly goodstructure identifcation features of our algorithm could betraced back to the scale-free character of many biologicalnetworks.

    Disclosure

    All materials, source code, and small case studies data canbe freely downloaded from http://www.pw.ethz.ch/research/projects/complexnetworks/. Very large datasets are availableon request.

    Acknowledgments

    Te authors would like to thank Professor Peter Widmayerfor the ongoing support of this project. Tis work was alsosupported by Commission for Technology and Innovation ofthe Swiss Federation under Grant 9428.1 PFLS-LS.

    References

    [1] T. Kamada and S. Kawai, An algorithm for drawing generalundirected graphs, Information Processing Letters, vol. 31, no. 1,pp. 715, 1989.

    [2] T. M. J. Ffuchterman and E. M. Reingold, Graph drawing byforce-directed place-ment, Sofware, vol. 21, no. 11, pp. 11291164, 1991.

    [3] A. T. Adai, S. V. Date, S. Wieland, and E. M. Marcotte, LGL:creating a map of protein function with an algorithm for visu-alizing very large biological networks, Journal of MolecularBiology, vol. 340, no. 1, pp. 179190, 2004.

    [4] K. Han and B.-H. Ju, A fast layout algorithm for protein inter-action networks, Bioinformatics, vol. 19, no. 15, pp. 18821888,2003.

    [5] J. Abello, M. Resende, and S. Sudarsky, Massive quasi-cliquedetection, in LATIN 2002:Teoretical Informatics, S. Rajsbaum,Ed., vol. 2286 of Lecture Notes in Computer Science, pp. 598612,Springer, Berlin, Germany, 2002.

    [6] W. Li and H. Kurata, A grid layout algorithm for automaticdrawing of biochemical networks, Bioinformatics, vol. 21, no.9, pp. 20362042, 2005.

    [7] S. E. Schaefer, Graph clustering, Computer Science Review,vol. 1, pp. 2764, 2007.

    [8] N. Mishra, R. Schreiber, I. Stanton, and R. Tarjan, Clusteringsocial networks, in Algorithms and Models For the Web-Graph,A. Bonato and F. Chung, Eds., vol. 4863 of Lecture Notes inComputer Science, pp. 5667, Springer, Berlin, Germany, 2007.

    [9] J. Hastad, Clique is hard to approximate within n1, ActaMathematica, vol. 182, pp. 105142, 1999.

    [10] A. Frick, A. Ludwig, and H. Mehldau, A fast adaptive layoutalgorithm for undirected graphs (extended abstract and systemdemonstration), in Graph Drawing, R. Tamassia and I. Tollis,

    Eds., vol. 894 of Lecture Notes in Computer Science, pp. 388403,Springer, Berlin, Germany, 1995.

    [11] E. M. Reingold and J. S. Tilford, Tidier drawings of trees, IEEETransactions on Sofware Engineering, vol. SE-7, no. 2, pp. 223228, 1981.

    [12] S. Grivet, D. Auber, J. P. Domenger, and G. Melancon, Bubbletree drawing algorithm, in Computer Vision and Graphics, K.Wojciechowski, B. Smolka, H. Palus, R. Kozera,W. Skarbek, andL. Noakes, Eds., vol. 32 of Computational Imaging and Vision,pp. 633641, Springer, Dordrecht, Te Netherlands, 2006.

    [13] T.Hruz, O. Laule, G. Szabo et al., Genevestigator v3: a referenceexpression database for the meta-analysis of transcriptomes,Advances in Bioinformatics, vol. 2008, Article ID 420747, 5pages, 2008.

    [14] T.Asami, Y. K.Min, K. Sekimata et al., Mode of action of brassi-nazole: a specifc inhibitor of brassinosteroid biosynthesis,ACSSymposium Series, vol. 774, pp. 269280, 2001.

    [15] J.-X. He, J. M. Gendron, Y. Yang, J. Li, and Z.-Y. Wang, TeGSK3-like kinase BIN2 phosphorylates and destabilizes BZR1,a positive regulator of the brassinosteroid signaling pathway inArabidopsis, Proceedings of the National Academy of Sciencesof the United States of America, vol. 99, no. 15, pp. 1018510190,2002.

    [16] K. J. Halliday, Plant hormones: the interplay of brassinosteroidsand auxin, Current Biology, vol. 14, no. 23, pp. R1008R1010,2004.

    [17] K. Y. Yip, R. P. Alexander, K.-K. Yan, and M. Gerstein,Improved reconstruction of in silico gene regulatory networksby integrating knockout and perturbation data, PLoS ONE, vol.5, no. 1, Article ID e8121, 2010.

    [18] M. Mutwil, B. Usadel, M. Schutte, A. Loraine, O. Ebenhoh,and S. Persson, Assembly of an interactive correlation networkfor the Arabidopsis genome using a novel Heuristic ClusteringAlgorithm, Plant Physiology, vol. 152, no. 1, pp. 2943, 2010.

    [19] D. Marbach, R. J. Prill, T. Schafer, C. Mattiussi, D. Flore-ano, and G. Stolovitzky, Revealing strengths and weaknessesof methods for gene network inference, Proceedings of theNational Academy of Sciences of the United States of America,vol. 107, no. 14, pp. 62866291, 2010.

    [20] S. De Bodt, S. Proost, K. Vandepoele, P. Rouze, and Y. Van dePeer, Predicting protein-protein interactions in Arabidopsisthaliana through integration of orthology, gene ontology andco-expression, BMC genomics, vol. 10, p. 288, 2009.

    [21] D. R. Rhodes, S. A. Tomlins, S. Varambally et al., Probabilisticmodel of the human protein-protein interaction network,Nature Biotechnology, vol. 23, no. 8, pp. 951959, 2005.

    [22] A. de la Fuente, N. Bing, I. Hoeschele, and P. Mendes, Discov-ery of meaningful associations in genomic data using partialcorrelation coefcients, Bioinformatics, vol. 20, no. 18, pp.35653574, 2004.

    [23] J.-F. Rual, K. Venkatesan, T. Hao et al., Towards a proteome-scale map of the human protein-protein interaction network,Nature, vol. 437, no. 7062, pp. 11731178, 2005.

    [24] K. Ishii, T. Washio, T. Uechi, M. Yoshihama, N. Kenmochi, andM. Tomita, Characteristics and clustering of human ribosomalprotein genes, BMC Genomics, vol. 7, article 37, 2006.

    [25] O. Atias, B. Chor, and D. A. Chamovitz, Large-scale analysisof Arabidopsis transcription reveals a basal co-regulation net-work, BMC Systems Biology, vol. 3