9
Review Functional annotation and biological interpretation of proteomics data Carolina M. Carnielli, Flavia V. Winck, Adriana F. Paes Leme Laboratório de Espectrometria de Massas, Laboratório Nacional de Biociências, LNBio, CNPEM, Campinas, Brazil abstract article info Article history: Received 11 September 2014 Received in revised form 7 October 2014 Accepted 21 October 2014 Available online 31 October 2014 Keywords: Bioinformatics Ontology Annotation Network Systems biology Proteomics Proteomics experiments often generate a vast amount of data. However, the simple identication and quantication of proteins from a cell proteome or subproteome is not sufcient for the full understanding of complex mechanisms occurring in the biological systems. Therefore, the functional annotation analysis of protein datasets using bioinformatics tools is essential for interpreting the results of high-throughput proteomics. Although large-scale proteomics data have rapidly increased, the biological interpretation of these results remains as a challenging task. Here we reviewed basic concepts and different programs that are commonly used in proteomics data functional annotation, emphasizing the main strategies focused in the use of gene ontology annotations. Furthermore, we explored the characteristics of some tools developed for functional annotation analysis, concerning the ease of use and typical caveats on ontology annotations. The utility and variations between different tools were assessed through the comparison of the resulting outputs generated for an example of proteomics dataset. © 2014 Elsevier B.V. All rights reserved. 1. Introduction Proteomics encompasses a broad range of high-throughput technol- ogies that allows the identication and the quantication of proteins in complex biological samples. Quantitative proteomics approaches rely on the ability to detect small changes in protein abundance of an altered state given a control or reference condition. Thus, the quantication of differences between two or more physiological states of a biological system can be expressed as an absolute protein quantication, by the determination of the exact protein amount or concentration, or as a rel- ative quantication of protein amount, in which the amount of a protein can be dened as fold changes relative to the control sample, determin- ing the up- or down-regulation of such protein [1,2]. Proteomics ap- proaches have been extensively applied in biomedical research for the understanding of diseases, including protein-based biomarker discov- ery for the early detection and monitoring of different types of cancer [3,4], the analysis of abnormal protein phosphorylation patterns associ- ated with diseases [5,6], such as Alzheimer's [7], the identication of therapeutic targets [8,9], among others. However, mass spectrometry- based proteomics often generates large lists of identied proteins whose interpretation is a challenging task in the eld. In order to handle the proteomics data, Biostatistics and Bioinformatics tools become indispensable to the interpretation of biological data and to extract the biological relevance from the vast amount of identied proteins [10]. Thus, protein functional annotation through computational tools now occupies a place as important as the protein identication itself. Since the advent of shotgun proteomics, many Bioinformatics tools have been developed to provide methodologies for functional annotation of proteomics data. Typical approaches for data interpretation for organisms without an annotated genome include mainly the automated protein annotation as a rst step in the data analysis workow. Protein domains, protein family, subcellular localization and biological function are predicted based on sequence similarity searches [1114]. Once the protein sequences are functionally annotated, several other tools must be applied to the search for functional patterns and overrep- resentation of biological functions or processes in a protein dataset from qualitative or quantitative proteomics data. Further steps in the analysis usually include pathway analysis and the prediction of interaction net- works, which are generated through integration of different biological layers of information, such as gene expression and co-expression patterns, proteinprotein interactions and protein expression data. Moreover, visualization tools largely contribute to localize the presence of targeted proteins within cellular biological pathways, signaling cascades and metabolic pathways being the most represented ones in proteomic studies. A variety of commercial and open-source bioinformatics tools for the analysis of proteomics data and statistical tests have been developed. However, with the increased amount of proteomics data new challenges in data handling, analysis and visualization push forward the development of the eld of computational proteomics. In order to give an overview of tools and approaches currently applied in proteomics functional annotation, we reviewed and discussed different approaches, Biochimica et Biophysica Acta 1854 (2015) 4654 Corresponding author at: Laboratório Nacional de Biociências (LNBio), Centro Nacional de Pesquisa em Energia e Materiais (CNPEM), 13083970 Campinas, Brazil. Tel.: +55 19 3512 1118; fax: +55 19 3512 1006. E-mail addresses: [email protected] (C.M. Carnielli), [email protected] (F.V. Winck), [email protected] (A.F. Paes Leme). http://dx.doi.org/10.1016/j.bbapap.2014.10.019 1570-9639/© 2014 Elsevier B.V. All rights reserved. Contents lists available at ScienceDirect Biochimica et Biophysica Acta journal homepage: www.elsevier.com/locate/bbapap

Functional annotation and biological interpretation of proteomics data

Embed Size (px)

Citation preview

Page 1: Functional annotation and biological interpretation of proteomics data

Biochimica et Biophysica Acta 1854 (2015) 46–54

Contents lists available at ScienceDirect

Biochimica et Biophysica Acta

j ourna l homepage: www.e lsev ie r .com/ locate /bbapap

Review

Functional annotation and biological interpretation of proteomics data

Carolina M. Carnielli, Flavia V. Winck, Adriana F. Paes Leme ⁎Laboratório de Espectrometria de Massas, Laboratório Nacional de Biociências, LNBio, CNPEM, Campinas, Brazil

⁎ Corresponding author at: Laboratório Nacional dNacional de Pesquisa em Energia e Materiais (CNPEM)Tel.: +55 19 3512 1118; fax: +55 19 3512 1006.

E-mail addresses: [email protected] ([email protected] (F.V.Winck), adriana.paesleme@

http://dx.doi.org/10.1016/j.bbapap.2014.10.0191570-9639/© 2014 Elsevier B.V. All rights reserved.

a b s t r a c t

a r t i c l e i n f o

Article history:Received 11 September 2014Received in revised form 7 October 2014Accepted 21 October 2014Available online 31 October 2014

Keywords:BioinformaticsOntologyAnnotationNetworkSystems biologyProteomics

Proteomics experiments often generate a vast amount of data. However, the simple identification andquantification of proteins from a cell proteome or subproteome is not sufficient for the full understanding ofcomplexmechanisms occurring in the biological systems. Therefore, the functional annotation analysis of proteindatasets using bioinformatics tools is essential for interpreting the results of high-throughput proteomics.Although large-scale proteomics data have rapidly increased, the biological interpretation of these resultsremains as a challenging task. Here we reviewed basic concepts and different programs that are commonlyused in proteomics data functional annotation, emphasizing the main strategies focused in the use of geneontology annotations. Furthermore, we explored the characteristics of some tools developed for functionalannotation analysis, concerning the ease of use and typical caveats on ontology annotations. The utility andvariations between different tools were assessed through the comparison of the resulting outputs generatedfor an example of proteomics dataset.

© 2014 Elsevier B.V. All rights reserved.

1. Introduction

Proteomics encompasses a broad range of high-throughput technol-ogies that allows the identification and the quantification of proteins incomplex biological samples. Quantitative proteomics approaches relyon the ability to detect small changes in protein abundance of an alteredstate given a control or reference condition. Thus, the quantificationof differences between two or more physiological states of a biologicalsystem can be expressed as an absolute protein quantification, by thedetermination of the exact protein amount or concentration, or as a rel-ative quantification of protein amount, inwhich the amount of a proteincan be defined as fold changes relative to the control sample, determin-ing the up- or down-regulation of such protein [1,2]. Proteomics ap-proaches have been extensively applied in biomedical research for theunderstanding of diseases, including protein-based biomarker discov-ery for the early detection and monitoring of different types of cancer[3,4], the analysis of abnormal protein phosphorylation patterns associ-ated with diseases [5,6], such as Alzheimer's [7], the identification oftherapeutic targets [8,9], among others. However, mass spectrometry-based proteomics often generates large lists of identified proteinswhose interpretation is a challenging task in the field. In order to handlethe proteomics data, Biostatistics and Bioinformatics tools becomeindispensable to the interpretation of biological data and to extract the

e Biociências (LNBio), Centro, 13083970 Campinas, Brazil.

. Carnielli),lnbio.cnpem.br (A.F. Paes Leme).

biological relevance from the vast amount of identified proteins [10].Thus, protein functional annotation through computational tools nowoccupies a place as important as the protein identification itself. Sincethe advent of shotgun proteomics, many Bioinformatics tools havebeen developed to provide methodologies for functional annotationof proteomics data. Typical approaches for data interpretation fororganismswithout an annotated genome includemainly the automatedprotein annotation as a first step in the data analysis workflow. Proteindomains, protein family, subcellular localization and biological functionare predicted based on sequence similarity searches [11–14].

Once the protein sequences are functionally annotated, several othertools must be applied to the search for functional patterns and overrep-resentation of biological functions or processes in a protein dataset fromqualitative or quantitative proteomics data. Further steps in the analysisusually include pathway analysis and the prediction of interaction net-works, which are generated through integration of different biologicallayers of information, such as gene expression and co-expressionpatterns, protein–protein interactions and protein expression data.Moreover, visualization tools largely contribute to localize the presenceof targeted proteins within cellular biological pathways, signalingcascades and metabolic pathways being the most represented ones inproteomic studies.

A variety of commercial and open-source bioinformatics tools for theanalysis of proteomics data and statistical tests have been developed.However, with the increased amount of proteomics data newchallenges in data handling, analysis and visualization push forwardthe development of the field of computational proteomics. In order togive an overviewof tools and approaches currently applied in proteomicsfunctional annotation, we reviewed and discussed different approaches,

Page 2: Functional annotation and biological interpretation of proteomics data

Fig. 1. Interconnection of relationships of the hierarchical distribution of GO terms.An example of the hierarchical organization of GO annotations is shown for theGO term “actin filament bundle organization” together with the relationships andintermediate GO terms between the most ancestor (Biological process) to the mostspecific child GO term (actin filament bundle organization) (Figure adapted fromQuickGO — http://www.ebi.ac.uk/QuickGO).

47C.M. Carnielli et al. / Biochimica et Biophysica Acta 1854 (2015) 46–54

computational programs and strategies recently applied for datainterpretation, and how different aspects of the analysis can modifythe outcome of proteomics studies.

2. Biological meaning of large proteomics datasets through geneontology-based annotation approaches

The prediction of the functional role of identified proteins in abiological event involves a first step of gathering information, a taskthat must be performed before the actual biological data interpretationis achieved and may include genome and proteome annotations. Manytools have been developed to mine several databases of biologicalinformation to finally predict a protein function based on sequencesimilarities. Detailed strategies on genomics and proteomics sequenceannotation can be found in previous publications [11–17].

Nevertheless, once the genome and proteome are annotated,one of the most disseminated strategies of proteomics data functionalannotation includes the use of ontologies, which can be understood asan explicit specification of a conceptualization [18]. Usually, ontologiesare designed with hierarchical classes, communicating definitionswith clarity and objectivity, however, keeping extendibility. In Biology,the ontologies for genes and proteins usually describe the classificationof the molecules according to their role in the biological systems, usingcontrolled vocabulary, which permits the analysis of relationshipsbetween the ontology terms through data integration, retrieval andfunctional annotation of large datasets [19–22].

In this scenario, Ashburner et al. developed a controlled vocabularyapplicable to all eukaryotes, generating the Gene Ontology (GO)Consortium [23], with the aim to overcome the lack of interoperabilityof genomic databases resultant of the divergent nomenclature of genesand proteins. Every gene or protein can thus be described by a finitenumber of vocabulary terms, which are classified into one of the threeGO-categories or domains: biological process, molecular function orcellular component [23].

It is noteworthy that GO annotations to a term are included in ahierarchy of terms, having a more general annotation at the highestlevels of the hierarchy and more specific annotation at lower levelsof the hierarchy. Moreover, a GO term of lower hierarchical level(Child term) can have a relation to one or more terms of higher hierar-chical levels (Parent term), which can be traced up to one ormore of theGO root terms which correspond to the three GO domains (biologicalprocess, cellular component or molecular Function). For instance, if agene is found to be related to ‘actin filament bundle organization’according to its GO annotation, it will be annotated downwardswithin the hierarchy of its parent terms, which include ‘actinfilament organization’, ‘cytoskeleton organization’ and ‘organelleorganization’ (Fig. 1).

Thus, more information can be retrieved from parent terms, whichincreases the knowledge when making inferences about gene function.On the other hand, researchesmust consider that GO annotations can beredundant, i.e., a term can be associated to one gene or gene product bymore than one annotation. In a recent study, Gillis and Pavlidis [24]found that GO annotations are stable over short periods of time, withlosses of semantic similarity for 3% of the genes annotated betweenmonthly GO editions. Thus, some undergo changes in their ‘functionalidentity’ over time as a result of annotation updates, resulting in lossof semantic similarity matching. Additionally, they presented a way toquantify the stability of GO annotations over time and showed that, inamoderate time,many genes undergo changes in their annotated func-tionality. Thus, modifications on gene ontologies may influence the re-sults on functional annotation of experimental data [25]. Despite thatchanges in GO annotations are non-uniformly distributed over differentbranches of the ontology, the results of term-enrichment analyses werefound stable [25].

In order to observe and to demonstrate howdifferent versions of theGO annotations may affect the final interpretation of a proteomics

dataset we performed a comparison of the results of the enrichmentanalysis of GO terms using the application BiNGO v.3.0.2 [26] app inthe Cytoscape v.3.0.1 [27] to an example proteomics dataset previouslypublished [8]. We used equal parameters for the data analysis andchanges in the significant overrepresented terms were evaluatedby comparing the list of the Top 10 most significant overrepresentedGO terms.

It was observed that changes in the list of the Top 10most significantGO terms retrieved usingGO annotationfiles from2011, 2012, 2013 and2014 occurred with the different annotation files (SupplementalTable 1). However, 60% of the GO terms consistently appeared in theTop10 list of the most significant GO terms, implying a data driftamong versions of GO annotation. Nevertheless, most of the GO annota-tions remain partially stable over time. Thus, it is imperative to performfunctional annotation analysis with the most recent version of the GOannotation and ontology files. Moreover, it is important that novel ap-proaches on functional annotation are integrated into dynamic dataanalysis, allowing on-time updating of annotation files to facilitate andimprove the interpretation of published proteomics datasets. Detaileddescription of parameters and GO association and gene ontology filesused in this comparative analysis are available in the SupplementalTable 1.

Furthermore, knowing what has been modified between differentversions of the ontology can be very useful. The web service CODEX(Complex Ontology Diff Explorer) was developed to allow users to ver-ify which changes were performed in a precise version of the ontology[28]. However, it is crucial to report, in an ontology-based study,which version of the gene annotation was used in order to track alter-ations on functional annotation due to time dependence of GO results.

GO annotations can be applied to perform a functional profiling ofprocesseswhichmight be different in a particular set of genes, to predictgene function or to categorize genes in ontology terms [29]. Therefore,the identification of overrepresented categories or enrichment analysiscan be performed based on ontologies, contributing to functionally

Page 3: Functional annotation and biological interpretation of proteomics data

48 C.M. Carnielli et al. / Biochimica et Biophysica Acta 1854 (2015) 46–54

characterize protein sets. Bioinformatics tools for such analysis mapthe corresponding input data containing a protein list or gene list(experimental data) to the associated biological GO annotation,identifying statically overrepresented annotations [30]. The overrep-resented GO terms are ranked by the p-value score and can bebiologically interpreted or considered by the researcher.

The GO database is constructed with the most recent version of theontology andnewannotationfiles are contributed and sent bymembersof the GO Consortium. Thus, theGO undergoes frequent revisions to addnew relationships and terms or to remove obsolete ones. However,not all proteins have been completely annotated to a GO term. Thus,it is crucial to use recent available ontology versions when performingfunctional annotation analysis.

Each new annotation in the GO is linked to its source and a databaseentry is attributed to it. The source can be a literature reference, a data-base reference or computational analysis [31]. However, the most im-portant attribute of an annotation is the evidence code, which recordsthe supportive information on the gene annotation to a particularterm. GO annotations include evidence codes from four categories:inferred from experimental, computational, indirectly derived fromboth, or unknown [23]. The vast majority of available GO annotationsare assigned using computational methods (more than 95%, see [29]);however, several studies disregard annotations without manualcuration, i.e., those that are inferred from electronic annotation (IEA).These IEA annotations are generally thought to be of lower qualitythan those manually annotated and that should be used with caution,once they probably contain a higher portion of false positives [29]. Onthe other hand, IEA annotations are useful to provide a first insight ofwhich GO terms are associated to the analyzed dataset.

Considering that most annotations are inferred electronically,a methodology to systematically and quantitatively evaluate IEA anno-tations was previously developed [32]. Based on the releases of theUniProt Gene Ontology Annotation database (UniProtGOA), the largestcontributor of electronic annotations to GO consortium [33] used exper-imental annotations in the newer annotation releases to confirm orreject electronic annotations from older releases. Thus, for qualitymeasurement, they evaluated the proportion of electronic annotationsthat were confirmed by new experimental annotations (reliability),the power of electronic annotations to predict experimental annota-tions (coverage), and how informative the predicted GO terms are(specificity). They found that the reliability and the specificity of IEAannotations have improved in recent years compared to the annota-tions performed by curators [32]. Furthermore, considering that exper-imental verification would be extremely expensive even for a smallsubset of annotations, this method provides possibilities to identifyingthe subset of electronic annotations that can be considered confidentin a study. The ontology-based functional annotation, therefore, consti-tutes a valuable approach for the proteomics data analysis.

3. Functional enrichment analysis for identification ofoverrepresented biological mechanisms

In proteomic studies, the protein identity constitutes the mostimportant information for further biological annotation of large datasets. Furthermore, proteins identified through shotgun proteomics ap-proaches may be related to a broad diversity of biological functions,whichmayhave a role inmany different biological pathways.Moreover,variations on protein expression levels may give indications aboutalterations of cellular mechanisms, such as changes resulting from thedevelopment of pathological processes. To identify and prioritize theseassociations a wide range of bioinformatics tools have been developedin recent years. Some of these tools use enrichment analysis to identifythe biological information overrepresented in protein lists and alsoprovides network visualization of the biological interactions.

Enrichment analysis based on GO terms allows the functionalclassification and the detection of the most represented biological

annotations of a protein set, based on statistical methods. Thisapproach increases the probability of identifying the most pertinentbiological processes related to a biological mechanism under study.Therefore, it is important to note that if a biological process is alteredin a study, enrichment analysis may point sets of genes that shouldhave a higher (enriched) potential to be selected as a biological rele-vant group by the high-throughput screening technologies. The goalof this approach is to summarize the biological processes and pathwaysthat are most likely related to a given biological condition [34]. To afurther discussion on aspects of ontologies and annotations thatshould be considered when performing GO enrichment analysis seeRhee et al. [29].

The enrichment analysis can be quantitatively measured by statisti-cal methods, including Chi-square, Fisher's exact test, Binomial proba-bility and Hypergeometric distribution [34]. Once the significance testis performed formany groups, amultiple test correctionmust be carriedout in order to limit false positives, commonly known as a type I error.The incidence of false positives is proportional to the number of testsperformed and to the critical significance level (p-value cutoff). Thus,the enrichment p-value scores for the annotations must be correctedwith statistical methods, such as Bonferroni or Benjamini–Hochberg,to adjust the individual p-value for each gene to keep the overall errorrate (or false positive rate) to less than or equal to the p-value cutoff,as set by the user [35]. Several tools are available to accomplishGO-term analysis. An extensive list of GO-related tools can be found athttp://neurolex.org/wiki/Category:Resource:Gene_Ontology_Tools.Examples of computational tools for functional annotation ofproteomics data are listed in Table 1.

Huang et al. [34] performed a survey of the principal aspects ofenrichment analysis and identified 68 bioinformatics enrichmenttools, classifying them into three classes, according to the mainalgorithms implemented: Singular Enrichment Analysis (SEA),Gene Set Enrichment Analysis (GSEA), and Modular EnrichmentAnalysis (MEA) [34,36]. Basically, SEA is the most used strategy, inwhich the user selects a list of interesting proteins (e.g. differentiallyexpressed proteins between two experimental conditions) and gen-erates a list of statistically significant overrepresented annotations.Of note, different tools for enrichment analysis may generate differ-ent results on the functional annotation of proteomics data. For com-parative purposes, we performed the enrichment analysis for anexample proteome dataset [8] using six different tools and themost significant overrepresented terms were compared. The toolsBiNGO, DAVID [37], GeneSetDB [38], FuncAssociate [39], GOMiner[40] and BioMart [41] were compared in this study. The results ofthe Top 10 most significant terms are shown in SupplementalTable 2. All six tools are based on SEA algorithms, which is currentlythe most disseminated type of analysis. By using the default param-eter settings offered by each tool for the enrichment analysis, theresulting Top 10 list of the most significant overrepresented GOterms was compared between the tools. We did not observe a 100%similarity between the Top 10 lists. The tools that presented themore similar results were BiNGO [26] and DAVID showing 80% ofsimilar GO terms in the Top 10 list, followed by DAVID and GoMiner,which showed 70% of similarity in the Top 10 lists of overrepresent-ed GO terms. This is not an indication that one tool is better or worsethan others, but it implies that different tools may and will probablygenerate different data outcomeswhichmay thenmodify the biologicalinterpretation of the proteomics data.

Different from the SEA tools, the GSEA algorithms take intoconsideration quantitative experimental data, such as protein foldchange without pre-selection of a candidate protein list. Finally,MEA tools inherit the basic enrichment calculation of SEA strategy,but also consider term–term and gene–gene relationships to determinethe enrichment p-value. Therefore, this classification helps to select thetools that better integrate into the different strategies of proteomicsfunctional annotation.

Page 4: Functional annotation and biological interpretation of proteomics data

Table 1Computational tools for functional annotation.

URL Comments Citation

Gene ontologyAmiGO http://amigo.geneontology.org/cgi-bin/amigo/go.cgi Search engine of Gene Ontology annotation [68]BiNGO http://www.psb.ugent.be/cbd/papers/BiNGO/Home.html Tool for enrichment analysis on Cytoscape [26]DAVID http://david.abcc.ncifcrf.gov/ Meta-tool for functional analysis of large gene lists [69]GeneMANIA http://genemania.org/ Tool to identify the most related genes to a query gene [59]

Pathway analysisKEGG http://www.genome.jp/kegg/ Database resource for pathway analysis [55]Reactome http://www.reactome.org/ Tool for pathway analysis [70]MetaCore http://thomsonreuters.com/metacore/ Commercial tool for pathway analysis and data mining

Interaction networksCytoscape http://www.cytoscape.org/ Open-source software for integration, visualization and

analysis of biological networks[27]

IIS — IntegratedInteractome system

http://bioinfo03.ibi.unicamp.br/lnbio/IIS2/index.php Integrative platform for the annotation, analysis andvisualization of the interaction profiles of proteins/genes,metabolites and drugs of interest

[52]

STRING http://string.embl.de Database tool for direct or indirect protein interactions [71]Enrichr http://amp.pharm.mssm.edu/Enrichr/index.html Integrative software to rank enriched terms, interactive

visualization for enrichment results display[72]

Text miningiHOP http://www.ihop-net.org/ Tool for search textual information from PubMed for

a given gene[73]

Chilibot http://www.chilibot.net/ PubMed literature database about specific relationshipsbetween proteins and genes

[74]

BioText search engine http://biosearch.berkeley.edu/index.php?action=logout Search engine to access scientific literature [75]PPI Finder http://liweilab.genetics.ac.cn/tm/ Human Protein–Protein Interaction Mining Tool [76]PIE the search http://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/

PIE/index.htmlTool for searching PubMed literature on proteininteraction information

[77]

49C.M. Carnielli et al. / Biochimica et Biophysica Acta 1854 (2015) 46–54

Moreover, a new annotationmethodwas developed for quantitativeproteome analysis, Protein Set Enrichment Analysis (PSEA), to assesspathway level patterns in the expression changes [42]. As amodificationof GSEA, PSEA uses the complete quantitative profiles of all identifiedproteins with no arbitrary cutoffs. A previous study performed thismethod and not only found many confirmatory biological findings forbreast epithelial malignancy, but also revealed significant changesof downstream targets for a number of common transcription factorssuggesting a role for specific gene regulatory pathways in breast tumor-igenesis [42]. Nevertheless, whereas enrichment analysis ranks a list ofannotations based on the p-value scores [34], the researcher must becritical to discriminate terms that are poorly related to the study.There is an important component of overestimation of significantoverrepresented GO terms when using Fisher's Exact Test if the proteinset has a high number of GO annotations. As Fisher's test relies on theassumption that all genes have an equal probability to be selectedunder the null hypothesis, it is expected that a level of false positives,highly overlapping annotations are considered overrepresented in adataset. New approaches, accounting for this implicit bias, have beenrecently proposed, including the Annotation Enrichment Analysiswhich accounts for the non-uniformity of the number of functionsassociated with individual genes and the number of genes annotatedto an individual function [43].

Furthermore, literature analysis using text mining tools for infor-mation retrieval [36,44] may be very useful to help researchers toexplore and access the scientific literature [45] by pre-selectingspecific parameters (e.g., keywords, gene names) for data searchingand analysis. For an outlook on literature retrieval and text mining,see Manconi et al. [46].

4. Integration of functional annotations through biologicalnetwork analysis

In the past years, several models of biological networks havebeen generated by computer science in order to aid visualization ofresults from the simulation of biological systems, such as biochemi-cal reactions [47] and protein interaction [48] networks. The aims

of such models in biological research are to (1) systematicallyinterrogate and experimentally verify knowledge of a pathway,(2) manage the complexity of cellular components and interactions,and (3) provide an outlook of properties and consequences of pathwayalterations [49].

With the large amount of “omics” data generated by experimentaltechnologies the challenge is now to integrate and deeply explore thislarge amount of information in order to provide a global understandingof biological functions and systems. Several tools have been developedin order to process and analyze the resulting large-scale datasets, suchas proteome and transcriptome data. A common computationalapproach includes the integration of overrepresented annotationscalculated through the enrichment analysis, with results usuallydisplayed as a network, where nodes indicate the molecular entities(proteins, transcripts or genes) and edges representing the differenttypes of relationships (e.g. co-expression, co-localization, physicalinteractions) between the nodes or it may represent processes orhierarchical connections between the ontology terms.

Among the programs used for proteomics data interpretation,Cytoscape appears among the most well-known open source softwareplatform which allows visualization of biological networks generatedfrom high-throughput expression data. Several tools (apps) have beenfurther developed to handle biological pathways in combination to theCytoscape visualization structure, allowing integration of networkscontaining functional annotations and gene expression data, proteinabundance levels and other biological information [50]. Thus, thesoftware platform provides features for data visualization andintegration with apps contributed by the community. The list ofapps can be assessed and installed online from Cytoscape App Store(http://apps.cytoscape.org/) or via app manager within Cytoscape.

On this scenario, the applications ClueGO [51] and BiNGO [26]withinCytoscape suite, the Integrated Interactome System (IIS) [52] andMetaCore (MetaCore TM, http://thomsonreuters.com/metacore/)appear as interesting tools and platforms for proteomics functionalannotation analysis. Each of these tools has a distinct set of featuresfor functional annotation analysis of proteomics data and visualizationof biological networks.

Page 5: Functional annotation and biological interpretation of proteomics data

50 C.M. Carnielli et al. / Biochimica et Biophysica Acta 1854 (2015) 46–54

However, many software and web tools accept only specific typesof protein identifiers as input data. Protein names do not have standard-ized identifiers and may differ in distinct databases or in updatedversions of the same database. Protein databases such as Uniprot [53]and Ensembl [54] are commonly used in proteomic studies and arevery useful for protein name normalization. However, another possibil-ity to overcome the ambiguity of protein identifiers is to use gene namescorresponding to the proteins of interest. Nevertheless, in some cases,the use of gene names in functional annotation may lead to ambiguousannotations. For a better understanding of bioinformatics servicesfor the normalization of protein and gene identifiers from large-scaleproteomics data sets, see Malik et al. [30]. Once the protein or genelist is normalized and checked for no redundancies, the proteome datacan be analyzed using different software for data interpretation.

The functional interpretation of protein sets by ClueGO app canimprove biological interpretation of large lists of genes using GeneOntology (GO) terms [23] and KEGG/BioCarta pathways [55] for statis-tical analysis, resulting in a functionally organized GO/pathway termnetwork. Additionally to GO term enrichment analysis, ClueGO includesthe possibility to perform a depletion analysis, which can be carried outindividually or together with enrichment test. Additionally, it is alsopossible to selectwhich statistical testwill be applied for p-value correc-tion. For a deeper discussion on the different approaches used to deter-mine the significant enrichments and/or depletions of GO categoriesamong a gene dataset, see Rivals et al. [56].WhenGO terms are selected,users can set the interval of GO hierarchical levels to be considered inthe analysis, which depends if specific or general terms are desired inthe interpretation of the study. Other parameters found in ClueGOinclude the minimal number of genes to consider a term as enrichedin the input list, and also how much these genes must represent(in percentage) from the total of genes annotated for a certainterm. Among its features, it is also possible to analyze one or twogene lists for cluster comparison, with the visualization of the func-tional terms grouped in a network. One may also consider making afusion of related GO terms that have similar sets of associatedgenes in order to diminish the redundancy of the annotation and togrouping the terms based on GO hierarchy or based on Kappa score(kappa statistics of the association strength between the terms).

Here, we have explored the effects of changes of the parametersettings of the enrichment analysis performed by ClueGO app v.2.1.2in the determination of overrepresented GO terms in an exampleproteomics dataset from tumor tissues [8]. For this analysis, clustercomparison mode was selected on in order to visualize the enrichedGO terms in the up- and down-regulated proteins of the dataset. Twointervals of GO hierarchical levels were considered (default intervalof GO levels 3–8 and the broader interval of GO levels from 0 to 20)separately and GO term fusion setting was applied while testing differ-ent minimum and maximum number of genes and gene percentage onthe enrichment analysis. The number of significant (p b 0.05) overrep-resented GO terms generated through the use of ClueGO with GOterm fusion setting on was compared with the results from the analysiswith this setting off (Supplemental Table 3). As expected, a reduction onthe number of terms was observed when applying GO term fusion.Moreover, the resulting list of the Top 10most significant biological pro-cess and cellular component GO terms of these analyses is shown inSupplemental Table 4, while the Top 10 most significant molecularfunction GO terms are shown in Supplemental Table 5. Consideringthe analysis performed using the same intervals of GO hierarchicallevels, only a few GO terms were observed consistently in the Top 10list of overrepresented GO terms, considering the comparison of the re-sults obtained for the different enrichment parameters tested. However,for those terms which appeared in all Top 10 lists, the p-values werevery similar, as expected.

Another Cytoscape app used to determine Gene Ontology (GO)terms overrepresented in a gene set is BiNGO. This plugin maps thepredominant functional GO terms of the tested gene set on the GO

hierarchy and displays the results as a network. An interestingfeature of BiNGO is the possibility to use custom ontologies andannotations. We, thus, explored this capability and performed a GOterm enrichment analysis using a customized reference GO annota-tion file which included the GO annotation of the proteins identifiedin our proteomic example dataset [8]. We compared the results ofthe enrichment analysis performed by BiNGO using our customizedreference GO annotation with the results of the enrichment analysisof GO terms performed with the whole proteome GO annotation fileas reference set. The input data used in this comparison was the listof the gene names of the proteins differentially expressed in ourexample proteomics study. The list of the Top 10 most significantGO terms of the analysis using customized GO annotation file isshown in the Supplemental Table 6. All terms found as overrepre-sented in the analysis with the custom annotation were also foundin the analysis with the whole annotation. However, the p-valuesof the significant overrepresented GO terms were much higherwhen generated through the use of customized GO annotation asreference set. This result was expected since the p-value for theenrichment analysis is calculated based on the considered subset ofproteins. On the other hand, we may imply that the enrichmentanalysis performed with the customized GO annotation indicates,through its p-values, that there seems to be no GO term significantlyenriched in the dataset, contrasting with the very low p-values foundin the enrichment analysis performed with the whole GO annotationas reference set. These findings suggest that high-throughput data,such as from transcriptome and proteome analysis, could be furtherincluded in the process of functional annotation; however, the num-ber of proteins used to build the reference GO annotation file mustbe large enough to represent close to the totality of the proteinsexpressed in a specific moment and experimental condition.

Another app available through Cytoscape, called GeneMANIA, canpredict interactions in which a gene or protein cluster participatesand it generates networks from currently available data [57–59].GeneMANIA creates an interaction network for the query list and,by a guilt-by association approach, integrates similar genes that areidentified usingmore than 800 annotated networks from six organismsavailable on databases [57,60]. Thus, the researchers can generate aninteraction network of proteins based on genetic interactions, sharedprotein domains, pathways, physical interactions, co-localization,co-expression and in silico predictions. GeneMANIA retrieves datafrom different sources, including individual studies and large databasessuch as BIOGRID [61], GEO [62], I2D [63] and Pathway Commons [64].The analysis of a single gene or protein can also be useful, once itwill display which genes/proteins are known to interact to eachother, whichmight help to have an overview of possible interactions.Additionally, for the integration of biological networks and gene func-tion prediction, GeneMANIA can also be applied for gene prioritizationfor functional assays [59].

Further programs, such as the Integrated Interactome System (IIS),have been recently released providing an integrative platform forthe annotation, analysis and visualization of the interaction profiles ofproteins/genes, metabolites and drugs of interest. IIS platform wasbuilt in an easy-to-use web-based interface and it is divided into differ-entmodules, startingwith the input of raw sequencing data or the inputof a list of gene or protein identifiers, followed by the annotation andinteractome modules. Five parameter classes need to be selected bythe user to perform interactome analysis, which includes the organismof study, the network configuration, the score cutoff, the two-hybridparameters and the expression analysis. IIS works with diverseorganism datasets and also offers the possibility to constructnetworks with interactions between different organisms or usingan orthologous relationship, a feature that can be applied as an alterna-tive to analyze data from organisms without complete genome annota-tion by selecting a closely related organism that has already beenannotated. In the expression analysis parameters, the user can set cutoff

Page 6: Functional annotation and biological interpretation of proteomics data

Fig. 2. Interaction network of the proteins identified as differentially expressed in tumor tissues. The protein–protein networkwas built using IIS software and visualized using Cytoscapesoftware for a previous proteomics dataset (Simabuco et al., 2014). (A) The enriched biological process (p ≤ 0.05) or (B) enriched KEGG pathways (p ≤ 0.05) among the up-regulatedproteins (red), down-regulated proteins (green) and background intermediary proteins (gray) from IIS database are depicted in the network by clustering the proteins involved ineach of the biological process/pathways with a circle layout. Proteins belonging to more than one process or pathway were assigned to the one with the best enrichment p-values. Thenode sizes of up-, down- and non-regulated proteins are depicted proportional to their fold change expression values.

51C.M. Carnielli et al. / Biochimica et Biophysica Acta 1854 (2015) 46–54

Page 7: Functional annotation and biological interpretation of proteomics data

52 C.M. Carnielli et al. / Biochimica et Biophysica Acta 1854 (2015) 46–54

values according to fold change in expression/concentration levels onthe dataset, which is used to define the input node sizes and to colorthe nodes according to the up- or down-regulation of the proteinexpression. Regarding the enrichment analysis, the program uses GObiological processes and KEGG pathways to calculate the enrichmentin the generated network. For a detailed description of statisticalparameters available on IIS, please see [52]. Thus, we explored theapplicability of IIS and uploaded in the IIS first module our exampleproteome dataset [8] as a text file composed of UniProt IDs and thecorresponding fold change values of relative expression, in order toget a protein–protein network interaction. The resulting network wasthen visualized using Cytoscape 3.0.2 and significantly enriched KEGGpathways or GO biological process were assigned as clusters (Fig. 2).This network arrangement is useful to visually detect which proteins,here represented by nodes, are participating in a certain biologicalprocess or pathway. Despite the lower amount of proteins annotatedby KEGG pathways, several proteins from the input list were foundenriched in the dataset based on GO biological process category.Furthermore, based on database interactions, IIS networks provideinformation of proteins that might be interacting with the input list,which may contribute to a better understanding of the biological roleof the proteins studied.

Besides the public and open source programs such as Cytoscape andIIS, commercial tools are also available for proteomics data interpreta-tion. Among the tools broadly used for proteomics data interpretationis MetaCore. This program was developed to attend a broad range ofbiological questions, offering pathway analysis, knowledge mining,model disease pathway, and target and biomarker assessment. Enrich-ment analysis of one or more protein sets at once can be performedon MetaCore, whose results include pathway maps, process networks,diseases (based on biomarkers), metabolic networks and GO categoriesrelated to the input protein sets.

Our example proteome dataset was uploaded onto MetaCore andenrichment analysis was performed. A summary of the results wasexported and saved as a detailed reportwith the Top 10most significantpathway maps, process networks, diseases (by biomarkers), metabolicnetworks and GO processes (Supplemental Tables 7–12). No cancer-associated disease was found to be enriched, but the results showedpossible metabolic alterations during the tumor development,as pointed by pathway maps and GO processes. Landi et al. usedMetaCore and DAVID [34,65] softwares for the biological interpretationof the proteomics data and the results provided protein-interactionnetwork maps and insights into biological responses and potentialpathways on bronchoalveolar fluid [66]. Chen et al. identified potentialbladder cancer markers through the use of MetaCore for the functionalannotation of the proteomics alterations observed and caused by thedisease [67].

Functional enrichment analysis is a common start in analyzingproteomics data from one or more datasets, followed by networkprotein interactions and pathway analysis, which are now essentialto extract meaningful biological information from proteomics data.Therefore, the interpretation of protein lists is not limited to the sim-ple identification of proteins, and different software can providemeaningful biological insights on cellular processes, related diseases,biomarker candidates, among other features. However, one canencounter dissimilar results on functional annotation when usingdifferent software and approaches, which may generate multiplenetworks of interactions or pathways after performing enrichmentanalysis of a gene or gene lists. For instance, a gene that is annotatedand well described in the literature for a certain function might alsobe found in a new study as involved in a previous unknown function.Thus, the results from a functional annotation may not include thispossible role and may poorly contribute to the understanding ofsuch association, once network interactions and pathways are gener-ally describing common process and functions, even when usingdifferent tools for the sake of complementarity. Therefore, manual

curation and consulting the recent literature are actions that stillneed to be done in order to correctly assign proper knowledge tothe proteomics functional annotation analysis. The development ofnew tools for dynamic integration of updated knowledge is indeednecessary and may contribute effectively to improve functionalannotation of proteomics data.

5. Conclusions

Datasets generated in proteomics experiments are usuallylarge lists of protein identifications. However, the extraction ofbiological meaning of these large datasets must be performedthrough functional annotation.

Furthermore, pathway analysis and generation of interactionnetworks based on previous data are fundamental in the visualizationand interpretation of biological processes involved in the conditionsstudied. Over the last years, bioinformatics tools were developed togathering the biological knowledge accumulated in public databasesfacilitating the analysis of large gene or protein lists, contributingto the generation of new hypothesis through significant biologicalinformation. In the present work we reviewed the application of somecomputational tools that can be used for the interpretation of prote-omics data regarding functional annotation through enrichmentanalysis methods, biological network generation and visualizationof biological data. Some of the programs and tools discussed herewere tested with the analysis of an example proteomics datasetand can easily be applied to other protein or gene sets. These toolscan greatly assist researchers to interpret their proteomics datausing different ontology annotations and biological knowledge,in order to reveal the identity of key biological mechanisms.

Supplementary data to this article can be found online at http://dx.doi.org/10.1016/j.bbapap.2014.10.019.

Acknowledgments

This work was supported by FAPESP grants: 2009/54067-3, 2010/19278-0, 2009/52833-0 and CNPq grants: 470549/2011-4, 301702/2011-0 and 470268/2013-1 to AFPL and a CAPES fellowship to C.M.C.

References

[1] M.H. Elliott, D.S. Smith, C.E. Parker, C. Borchers, Current trends in quantitativeproteomics, J. Mass Spectrom. 44 (2009) 1637–1660.

[2] S.E. Ong, M. Mann, Mass spectrometry-based proteomics turns quantitative, Nat.Chem. Biol. 1 (2005) 252–262.

[3] S. Gupta, A. Venkatesh, S. Ray, S. Srivastava, Challenges and prospects for biomarkerresearch: a current perspective from the developing world, Biochim. Biophys. Acta1844 (2014) 899–908.

[4] B. Pesch, T. Bruning, G. Johnen, S. Casjens, N. Bonberg, D. Taeger, A. Muller, D.G.Weber, T. Behrens, Biomarker research with prospective study designs for theearly detection of cancer, Biochim. Biophys. Acta 1844 (2014) 874–883.

[5] B.Macek,M.Mann, J.V. Olsen, Global and site-specific quantitative phosphoproteomics:principles and applications, Annu. Rev. Pharmacol. Toxicol. 49 (2009) 199–221.

[6] F.V. Winck, M. Belloni, B.A. Pauletti, L. Zanella Jde, R.R. Domingues, N.E. Sherman,A.F. Paes Leme, Phosphoproteome analysis reveals differences in phosphositeprofiles between tumorigenic and non-tumorigenic epithelial cells, J. Proteomics96 (2014) 67–81.

[7] F. Di Domenico, R. Sultana, E. Barone, M. Perluigi, C. Cini, C. Mancuso, J. Cai, W.M.Pierce, D.A. Butterfield, Quantitative proteomics analysis of phosphorylatedproteins in the hippocampus of Alzheimer's disease subjects, J. Proteomics 74(2011) 1091–1103.

[8] F.M. Simabuco, R. Kawahara, S. Yokoo, D.C. Granato, L. Miguel, M. Agostini, A.Z.Aragao, R.R. Domingues, I.L. Flores, C.C. Macedo, R. Della Coletta, E. Graner, A.F.Paes Leme, ADAM17 mediates OSCC development in an orthotopic murine model,Mol. Cancer 13 (2014) 24.

[9] J.L. Paltridge, L. Belle, Y. Khew-Goodall, The secretome in cancer progression,Biochim. Biophys. Acta 1834 (2013) 2233–2241.

[10] C. Kumar, M. Mann, Bioinformatics analysis of mass spectrometry-based proteomicsdata sets, FEBS Lett. 583 (2009) 1703–1712.

[11] A. Valencia, Automatic annotation of protein function, Curr. Opin. Struct. Biol. 15(2005) 267–274.

[12] B. Rost, J. Liu, R. Nair, K.O. Wrzeszczynski, Y. Ofran, Automatic prediction of proteinfunction, Cell. Mol. Life Sci. 60 (2003) 2637–2650.

Page 8: Functional annotation and biological interpretation of proteomics data

53C.M. Carnielli et al. / Biochimica et Biophysica Acta 1854 (2015) 46–54

[13] Y. Loewenstein, D. Raimondo, O.C. Redfern, J. Watson, D. Frishman, M. Linial, C.Orengo, J. Thornton, A. Tramontano, Protein function annotation by homology-based inference, Genome Biol. 10 (2009) 207.

[14] A.S. Juncker, L.J. Jensen, A. Pierleoni, A. Bernsel, M.L. Tress, P. Bork, G. von Heijne, A.Valencia, C.A. Ouzounis, R. Casadio, S. Brunak, Sequence-based feature predictionand annotation of proteins, Genome Biol. 10 (2009) 206.

[15] M.R. Brent, Genome annotation past, present, and future: how to define an ORF ateach locus, Genome Res. 15 (2005) 1777–1786.

[16] M. Yandell, D. Ence, A beginner's guide to eukaryotic genome annotation, Nat. Rev.Genet. 13 (2012) 329–342.

[17] L. Stein, Genome annotation: from sequence to biology, Nat. Rev. Genet. 2 (2001)493–503.

[18] T. Gruber, Towards principles for the design of ontologies used for knowledgesharing, Int. J. Hum. Comput. Stud. 43 (1993) 907–928.

[19] J.B. Bard, S.Y. Rhee, Ontologies in biology: design, applications and future challenges,Nat. Rev. Genet. 5 (2004) 213–222.

[20] M. Gan, X. Dou, R. Jiang, From ontology to semantic similarity: calculation ofontology-based semantic similarity, ScientificWorldJournal 2013 (2013) 793091.

[21] N.F. Noy, N.H. Shah, P.L. Whetzel, B. Dai, M. Dorf, N. Griffith, C. Jonquet, D.L. Rubin,M.A. Storey, C.G. Chute, M.A. Musen, BioPortal: ontologies and integrated dataresources at the click of a mouse, Nucleic Acids Res. 37 (2009) W170–W173.

[22] S. Schulze-Kremer, Ontologies for molecular biology and bioinformatics, In SilicoBiol. 2 (2002) 179–193.

[23] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K.Dolinski, S.S. Dwight, J.T. Eppig, M.A. Harris, D.P. Hill, L. Issel-Tarver, A. Kasarskis, S.Lewis, J.C. Matese, J.E. Richardson, M. Ringwald, G.M. Rubin, G. Sherlock, Geneontology: tool for the unification of biology. The Gene Ontology Consortium, Nat.Genet. 25 (2000) 25–29.

[24] J. Gillis, P. Pavlidis, Assessing identity, redundancy and confounds in Gene Ontologyannotations over time, Bioinformatics 29 (2013) 476–482.

[25] A. Gross, M. Hartung, K. Prufer, J. Kelso, E. Rahm, Impact of ontology evolution onfunctional analyses, Bioinformatics 28 (2012) 2671–2677.

[26] S. Maere, K. Heymans, M. Kuiper, BiNGO: a Cytoscape plugin to assess overrepresen-tation of gene ontology categories in biological networks, Bioinformatics 21 (2005)3448–3449.

[27] R. Saito, M.E. Smoot, K. Ono, J. Ruscheinski, P.L. Wang, S. Lotia, A.R. Pico, G.D. Bader, T.Ideker, A travel guide to Cytoscape plugins, Nat. Methods 9 (2012) 1069–1076.

[28] M. Hartung, A. Gross, E. Rahm, CODEX: exploration of semantic changes betweenontology versions, Bioinformatics 28 (2012) 895–896.

[29] S.Y. Rhee, V. Wood, K. Dolinski, S. Draghici, Use and misuse of the gene ontologyannotations, Nat. Rev. Genet. 9 (2008) 509–515.

[30] R. Malik, K. Dulla, E.A. Nigg, R. Korner, From proteome lists to biological impact—tools and strategies for the analysis of large MS data sets, Proteomics 10 (2010)1270–1283.

[31] Creating the gene ontology resource: design and implementation, Genome Res. 11(2001) 1425–1433.

[32] N. Skunca, A. Altenhoff, C. Dessimoz, Quality of computationally inferred geneontology annotations, PLoS Comput. Biol. 8 (2012) e1002533.

[33] D. Barrell, E. Dimmer, R.P. Huntley, D. Binns, C. O'Donovan, R. Apweiler, The GOAdatabase in 2009—an integrated Gene Ontology Annotation resource, NucleicAcids Res. 37 (2009) D396–D403.

[34] W. Huang da, B.T. Sherman, R.A. Lempicki, Bioinformatics enrichment tools: pathstoward the comprehensive functional analysis of large gene lists, Nucleic AcidsRes. 37 (2009) 1–13.

[35] Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: a practical andpowerful approach to multiple testing, J. R. Stat. Soc. 57 (1995) 289–300.

[36] M. Krallinger, A. Valencia, Text-mining and information-retrieval services formolecular biology, Genome Biol. 6 (2005) 224.

[37] G. Dennis Jr., B.T. Sherman, D.A. Hosack, J. Yang, W. Gao, H.C. Lane, R.A. Lempicki,DAVID: Database for Annotation, Visualization, and Integrated Discovery, GenomeBiol. 4 (2003) P3.

[38] H. Araki, C. Knapp, P. Tsai, C. Print, GeneSetDB: a comprehensive meta-database,statistical and visualisation framework for gene set analysis, FEBS Open Biol. 2(2012) 76–82.

[39] G.F. Berriz, O.D. King, B. Bryant, C. Sander, F.P. Roth, Characterizing gene sets withFuncAssociate, Bioinformatics 19 (2003) 2502–2504.

[40] B.R. Zeeberg, W. Feng, G. Wang, M.D. Wang, A.T. Fojo, M. Sunshine, S. Narasimhan,D.W. Kane, W.C. Reinhold, S. Lababidi, K.J. Bussey, J. Riss, J.C. Barrett, J.N.Weinstein, GoMiner: a resource for biological interpretation of genomic andproteomic data, Genome Biol. 4 (2003) R28.

[41] A. Kasprzyk, BioMart: driving a paradigm change in biological data management,Database (Oxford) (2011) (2011) bar049.

[42] S. Cha, M.B. Imielinski, T. Rejtar, E.A. Richardson, D. Thakur, D.C. Sgroi, B.L. Karger,In situ proteomic analysis of human breast cancer epithelial cells using laser capturemicrodissection: annotation by protein set enrichment analysis and gene ontology,Mol. Cell. Proteomics 9 (2010) 2529–2544.

[43] K. Glass, M. Girvan, Annotation enrichment analysis: an alternative method forevaluating the functional properties of gene sets, Sci. Rep. 4 (2014) 4191.

[44] J.J. Kim, D. Rebholz-Schuhmann, Categorization of services for seeking informationin biomedical literature: a typology for improvement of practice, Brief. Bioinform.9 (2008) 452–465.

[45] K.B. Cohen, L. Hunter, Getting started in text mining, PLoS Comput. Biol. 4(2008) e20.

[46] A. Manconi, E. Vargiu, G. Armano, L. Milanesi, Literature retrieval and miningin bioinformatics: state of the art and challenges, Adv. Bioinform. 2012(2012) 573846.

[47] W.W. Chen, M. Niepel, P.K. Sorger, Classic and contemporary approaches tomodeling biochemical reactions, Genes Dev. 24 (2010) 1861–1875.

[48] P. Aloy, R.B. Russell, Structural systems biology: modelling protein interactions, Nat.Rev. Mol. Cell Biol. 7 (2006) 188–197.

[49] P. Shannon, A. Markiel, O. Ozier, N.S. Baliga, J.T. Wang, D. Ramage, N. Amin, B.Schwikowski, T. Ideker, Cytoscape: a software environment for integrated modelsof biomolecular interaction networks, Genome Res. 13 (2003) 2498–2504.

[50] M.E. Smoot, K. Ono, J. Ruscheinski, P.L. Wang, T. Ideker, Cytoscape 2.8: newfeatures for data integration and network visualization, Bioinformatics 27(2011) 431–432.

[51] G. Bindea, B. Mlecnik, H. Hackl, P. Charoentong, M. Tosolini, A. Kirilovsky, W.H.Fridman, F. Pages, Z. Trajanoski, J. Galon, ClueGO: a Cytoscape plug-in to decipherfunctionally grouped gene ontology and pathway annotation networks,Bioinformatics 25 (2009) 1091–1093.

[52] M.F. Carazzolle, L.M. de Carvalho, H.H. Slepicka, R.O. Vidal, G.A. Pereira, J. Kobarg,G.V. Meirelles, IIS—Integrated Interactome System: a web-based platformfor the annotation, analysis and visualization of protein–metabolite–gene–drug interactions by integrating a variety of data sources and tools, PLoS One 9(2014) e100385.

[53] The Universal Protein Resource (UniProt), Nucleic Acids Res. 35 (2007) D193–D197.[54] T.J. Hubbard, B.L. Aken, S. Ayling, B. Ballester, K. Beal, E. Bragin, S. Brent, Y. Chen, P.

Clapham, L. Clarke, G. Coates, S. Fairley, S. Fitzgerald, J. Fernandez-Banet, L.Gordon, S. Graf, S. Haider, M. Hammond, R. Holland, K. Howe, A. Jenkinson, N.Johnson, A. Kahari, D. Keefe, S. Keenan, R. Kinsella, F. Kokocinski, E. Kulesha, D.Lawson, I. Longden, K. Megy, P. Meidl, B. Overduin, A. Parker, B. Pritchard, D. Rios,M. Schuster, G. Slater, D. Smedley, W. Spooner, G. Spudich, S. Trevanion, A. Vilella,J. Vogel, S. White, S. Wilder, A. Zadissa, E. Birney, F. Cunningham, V. Curwen, R.Durbin, X.M. Fernandez-Suarez, J. Herrero, A. Kasprzyk, G. Proctor, J. Smith, S.Searle, P. Flicek, Ensembl 2009, Nucleic Acids Res. 37 (2009) D690–D697.

[55] M. Kanehisa, S. Goto, KEGG: Kyoto Encyclopedia of Genes and Genomes, NucleicAcids Res. 28 (2000) 27–30.

[56] I. Rivals, L. Personnaz, L. Taing, M.C. Potier, Enrichment or depletion of a GO categorywithin a class of genes: which test? Bioinformatics 23 (2007) 401–407.

[57] J. Montojo, K. Zuberi, H. Rodriguez, F. Kazi, G. Wright, S.L. Donaldson, Q. Morris, G.D.Bader, GeneMANIA Cytoscape plugin: fast gene function predictions on the desktop,Bioinformatics 26 (2010) 2927–2928.

[58] S. Mostafavi, D. Ray, D.Warde-Farley, C. Grouios, Q. Morris, GeneMANIA: a real-timemultiple association network integration algorithm for predicting gene function,Genome Biol. 9 (Suppl. 1) (2008) S4.

[59] D. Warde-Farley, S.L. Donaldson, O. Comes, K. Zuberi, R. Badrawi, P. Chao, M. Franz,C. Grouios, F. Kazi, C.T. Lopes, A. Maitland, S. Mostafavi, J. Montojo, Q. Shao, G.Wright, G.D. Bader, Q.Morris, The GeneMANIA prediction server: biological networkintegration for gene prioritization and predicting gene function, Nucleic Acids Res.38 (2010) W214–W220.

[60] W. Tian, L.V. Zhang, M. Tasan, F.D. Gibbons, O.D. King, J. Park, Z. Wunderlich, J.M.Cherry, F.P. Roth, Combining guilt-by-association and guilt-by-profiling to predictSaccharomyces cerevisiae gene function, Genome Biol. 9 (Suppl. 1) (2008) S7.

[61] B.J. Breitkreutz, C. Stark, T. Reguly, L. Boucher, A. Breitkreutz, M. Livstone, R.Oughtred, D.H. Lackner, J. Bahler, V. Wood, K. Dolinski, M. Tyers, The BioGRIDinteraction database: 2008 update, Nucleic Acids Res. 36 (2008) D637–D640.

[62] T. Barrett, S.E. Wilhite, P. Ledoux, C. Evangelista, I.F. Kim, M. Tomashevsky, K.A.Marshall, K.H. Phillippy, P.M. Sherman, M. Holko, A. Yefanov, H. Lee, N. Zhang, C.L.Robertson,N. Serova, S. Davis, A. Soboleva, NCBIGEO: archive for functional genomicsdata sets—update, Nucleic Acids Res. 41 (2013) D991–D995.

[63] K.R. Brown, I. Jurisica, Online predicted human interaction database, Bioinformatics21 (2005) 2076–2082.

[64] PathwayCommons, http://www.pathwaycommons.org (Visited on November2013, in).

[65] W. Huang da, B.T. Sherman, Q. Tan, J.R. Collins, W.G. Alvord, J. Roayaei, R. Stephens,M.W. Baseler, H.C. Lane, R.A. Lempicki, The DAVID Gene Functional ClassificationTool: a novel biological module-centric algorithm to functionally analyze largegene lists, Genome Biol. 8 (2007) R183.

[66] C. Landi, E. Bargagli, L. Bianchi, A. Gagliardi, A. Carleo, D. Bennett, M.G. Perari, A.Armini, A. Prasse, P. Rottoli, L. Bini, Towards a functional proteomics approach tothe comprehension of idiopathic pulmonary fibrosis, sarcoidosis, systemic sclerosisand pulmonary Langerhans cell histiocytosis, J. Proteomics 83 (2013) 60–75.

[67] C.L. Chen, T.S. Lin, C.H. Tsai, C.C. Wu, T. Chung, K.Y. Chien, M. Wu, Y.S. Chang, J.S.Yu, Y.T. Chen, Identification of potential bladder cancer markers in urine byabundant-protein depletion coupled with quantitative proteomics, J. Proteomics85 (2013) 28–43.

[68] S. Carbon, A. Ireland, C.J. Mungall, S. Shu, B. Marshall, S. Lewis, AmiGO: online accessto ontology and annotation data, Bioinformatics 25 (2009) 288–289.

[69] W. Huang da, B.T. Sherman, R.A. Lempicki, Systematic and integrative analysis oflarge gene lists using DAVID bioinformatics resources, Nat. Protoc. 4 (2009) 44–57.

[70] D. Croft, A.F. Mundo, R. Haw, M. Milacic, J. Weiser, G. Wu, M. Caudy, P. Garapati, M.Gillespie, M.R. Kamdar, B. Jassal, S. Jupe, L. Matthews, B. May, S. Palatnik, K. Rothfels,V. Shamovsky, H. Song, M.Williams, E. Birney, H. Hermjakob, L. Stein, P. D'Eustachio,The Reactome pathway knowledgebase, Nucleic Acids Res. 42 (2014) D472–D477.

[71] B. Snel, G. Lehmann, P. Bork, M.A. Huynen, STRING: a web-server to retrieve anddisplay the repeatedly occurring neighbourhood of a gene, Nucleic Acids Res. 28(2000) 3442–3444.

[72] E.Y. Chen, C.M. Tan, Y. Kou, Q. Duan, Z. Wang, G.V. Meirelles, N.R. Clark, A. Ma'ayan,Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool,BMC Bioinformatics 14 (2013) 128.

[73] R. Hoffmann, A. Valencia, A gene network for navigating the literature, Nat. Genet.36 (2004) 664.

Page 9: Functional annotation and biological interpretation of proteomics data

54 C.M. Carnielli et al. / Biochimica et Biophysica Acta 1854 (2015) 46–54

[74] H. Chen, B.M. Sharp, Content-rich biological network constructed by miningPubMed abstracts, BMC Bioinformatics 5 (2004) 147.

[75] M.A. Hearst, A. Divoli, H. Guturu, A. Ksikes, P. Nakov, M.A. Wooldridge, J. Ye, BioTextSearch Engine: beyond abstract search, Bioinformatics 23 (2007) 2196–2197.

[76] M.He, Y.Wang,W. Li, PPIfinder: amining tool for humanprotein–protein interactions,PLoS One 4 (2009) e4554.

[77] S. Kim, D. Kwon, S.Y. Shin, W.J. Wilbur, PIE the search: searching PubMed literaturefor protein interaction information, Bioinformatics 28 (2012) 597–598.