BIIT - g:Profiler: a web server for functional enrichment analysis … · 2020. 10. 22. · 4 NucleicAcidsResearch,2019 PythonpackagesutilizethenewJSONAPIsthatnowoffer amorestandardizedwaytoaccessg:Profilerprogrammati-cally.Moreover

Nucleic Acids Research 2019 1doi 101093nargkz369

gProfiler a web server for functional enrichmentanalysis and conversions of gene lists (2019 update)Uku Raudvere 1dagger Liis Kolberg 1dagger Ivan Kuzmin 1 Tambet Arak1 Priit Adler12Hedi Peterson 12 and Jaak Vilo 123

1Institute of Computer Science University of Tartu J Liivi 2 50409 Tartu Estonia 2Quretec Ltd Ulikooli 6a 51003Tartu Estonia and 3Software Technology and Applications Competence Centre Ulikooli 2 51003 Tartu Estonia

Received February 27 2019 Revised April 07 2019 Editorial Decision April 16 2019 Accepted April 29 2019

ABSTRACT

Biological data analysis often deals with lists ofgenes arising from various studies The gProfilertoolset is widely used for finding biological cate-gories enriched in gene lists conversions betweengene identifiers and mappings to their orthologs Themission of gProfiler is to provide a reliable servicebased on up-to-date high quality data in a conve-nient manner across many evidence types identi-fier spaces and organisms gProfiler relies on En-sembl as a primary data source and follows theirquarterly release cycle while updating the other datasources simultaneously The current update providesa better user experience due to a modern responsiveweb interface standardised API and libraries Theresults are delivered through an interactive and con-figurable web design Results can be downloadedas publication ready visualisations or delimited textfiles In the current update we have extended thesupport to 467 species and strains including verte-brates plants fungi insects and parasites By sup-porting user uploaded custom GMT files gProfileris now capable of analysing data from any organ-ism All past releases are maintained for reproducibil-ity and transparency The 2019 update introducesan extensive technical rewrite making the servicesfaster and more flexible gProfiler is freely availableat httpsbiitcsuteegprofiler

INTRODUCTION

Interpretation of gene lists from high-throughput studiesneeds capable and convenient tools based on most up-to-date data There are a several functional enrichment anal-ysis tools such as Enrichr (1) WebGestalt (2) Metascape(3) KOBAS (4) and AgriGO (5) suitable for positioning the

novel findings against the body of previous knowledge Thelandscape of enrichment analysis tools is diverse coveringdifferent data sources species identifier types and methodsWhile the majority of services provide mappings to the mostwidely used knowledge resource Gene Ontology (GO) (6)the selection of other data sources varies between the toolsFor example Human Phenotype Ontology (7) is availablein Enrichr WebGestalt Metascape and gProfiler (1ndash38)while mirTarBase miRNA target information is includedonly in a few tools eg Enrichr and gProfiler (18) Thereare also services that focus on specific species eg AgriGOprovides data mostly about plants (5)

These tools have been implemented in a variety of tech-nical platforms For example WebGestalt has a well knownweb server (2) GSEA is known for its stand-alone applica-tion (9) Enrichr in addition to web service also has an Rpackage (1) Other tools serve its users across a variety oftechnical platforms For example gProfiler serves its usersvia a web client and API Python and R packages and isavailable as a tool for Galaxy platform (10)

The input gene lists of functional enrichment tools orig-inate from a broad range of experimental platforms eachhaving unique identifier types supported by default Mostof the tools accept only a limited subset of possible identi-fiers thus creating an obstacle that the users need to over-come via external tools This common hurdle is avoided ingProfiler by automatically detecting and accepting close toa hundred different identifier types possibly mixed togetherin the same query This name mapping functionality is pro-vided also as an independent gConvert service that has al-ready been incorporated as an interoperability function toseveral tools (11ndash13)

The methods used for enrichment analysis vary acrossdifferent tools gProfiler likewise to Enrichr and We-bGestalt (12) provides the most widely used over-representation analysis approach that uses the hypergeo-metric test to measure the significance of functional term inthe input gene list There exist tools that provide other meth-ods that also take into account additional ranking infor-

To whom correspondence should be addressed Tel +372 737 5483 Email vilouteeCorrespondence may also be addressed to Hedi Peterson Email hedipetersonuteedaggerThe authors wish it to be known that in their opinion the first two authors should be regarded as Joint First Authors

Ccopy The Author(s) 2019 Published by Oxford University Press on behalf of Nucleic Acids ResearchThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40) whichpermits unrestricted reuse distribution and reproduction in any medium provided the original work is properly cited

Dow

nloaded from httpsacadem

icoupcomnaradvance-article-abstractdoi101093nargkz3695486750 by ELN

ET Group Account user on 12 June 2019

2 Nucleic Acids Research 2019

mation of gene lists (WebGestalt GSEA (29)) or use priorknowledge from gene regulation networks (WebGestalt (2))All of these methods have their own limitations and thereare no good benchmark data to assess and compare differ-ent methods (14) In order to serve the users an easy-to-useand fast enrichment tool gProfiler has been focusing onone approach only

Only very few of the tools dedicated to enrichment anal-ysis have offered continuous and up-to-date service aftertheir initial release gProfiler has remained vital since itsfirst publication in the 2007 NAR web server issue and haspublished update articles in 2011 and 2016 (81516) In or-der to continuously support researchers from a variety ofdifferent scientific domains we have increased the numberof supported species and gene identifier types while keep-ing the data update frequency programmable access andcore high quality data sources in steady manner through-out the years (Figure 1) As the complexity and size of theunderlying data has been growing we have now introduceda complete technical rewrite of the gProfiler This allowsus to serve our users faster and more conveniently throughmodern user interface and programming interfaces as wellas opens up new avenues for adding features and maintain-ing the stable service

GPROFILER WEB SERVER

gProfiler (httpsbiitcsuteegprofiler) is a collection oftools that are commonly used in standard pipelines ofbiological entity (geneprotein) centered computationalanalysis gGOSt performs the functional enrichmentanalysis of individual or multiple gene lists gConvertmaps geneprotein identifiers between various namespacesgOrth allows to map orthologous genes across species andgSNPense maps human SNP identifiers to genes

gGOSt ndash functional enrichment analysis

gGOSt is the core tool for performing functional enrich-ment analysis on input gene list It maps a user providedlist of genes to known functional information sources anddetects statistically significantly enriched biological pro-cesses pathways regulatory motifs and protein complexesMain source of information about genes identifier typesand GO terms and associations is the Ensembl database(17) The Gene Ontology is the most comprehensive re-source in gGOSt covering 467 species and strains repre-sented in Ensembl Genomes and fungi plants or metazoaspecific versions of Ensembl (17) and parasite specific datafrom WormBase (18)

We include most common data sources that are also regu-larly updated This covers currently KEGG (19) Reactome(20) and WikiPathways (21) miRNA targets from miRTar-Base (22) and regulatory motif matches from TRANSFAC(23) tissue specificity based on expression data from Hu-man Protein Atlas (24) protein complexes data from CO-RUM (25) and human disease phenotype associations fromHuman Phenotype Ontology (7) Users are free to chooseany combination of the provided data sources This allowsusers for example to focus either on cellular componentsfrom the Gene Ontology or disease information from the

Human Phenotype Ontology For all data sources exceptfor two (KEGG and Transfac due to the licensing reasons)users can also download all the underlying terms and asso-ciations

The functional enrichment of the input gene list is evalu-ated using the well-proven cumulative hypergeometric testFor a single gene list we are testing numerous functionalterms at a time For example already gt16 000 GO biolog-ical process terms are considered for a human gene list Toreduce the amount of false positive findings gGOSt per-forms multiple testing correction By default we use theoriginal gSCS (Set Counts and Sizes) correction methodintroduced back in 2007 (15) Unlike the standard meth-ods gSCS considers the dependency of multiple tests bytaking into account the overlap of functional terms Werepeated the simulations for verifying the gSCS methodand confirmed that the method still holds the same prop-erties Namely gSCS is more conservative than Benjamini-Hochberg False Discovery Rate but not as strict as Bonfer-roni correction Instead of the default gSCS method usercan also choose to apply the Bonferroni correction or theBenjaminindashHochberg False Discovery Rate It is notewor-thy that in gGOSt we report only the adjusted enrichmentP-values

The input gene list can be either unordered or orderedBy default the list is considered unordered and complete Ifthere exists some natural ranking of the genes the orderedgene list option can be used For example the fold changein between different biological conditions the number ofneighboring nodes in a protein-protein interaction networkthe absolute expression values or any other information rel-evant for the experimental setup could be a good basis forsuch ranking In such cases the hypergeometric testing isperformed for each possible prefix of the list starting fromthe first gene and adding next genes one by one The smallestenrichment P-value is reported for each of the terms alongwith corresponding gene list length For different terms thislength can vary especially as the broader terms can be en-riched for larger lists only

The input can also consist of multiple lists simultaneouslyusing a FASTA format like symbol in front of every listThis multi-list analysis feature was previously implementedin the gCocoa tool

By default gGOSt uses the set of all annotated protein-coding genes as a background In some experiments how-ever only a subset of genes or proteins is measured For ex-ample the targeted sequencing of only disease specific geneswould imply enrichment of that disease association Sta-tistically for these cases it is recommended and sometimesnecessary to use custom background information when cal-culating the statistical enrichment significance The custombackground should include a list of genes that were actu-ally measured during the biological experiment such as allgenes in the sequencing panel This option allows to calcu-late a more precise evaluation of functional enrichment

The enrichment results in gGOSt are first highlighted ina novel Manhattan plot It is accompanied by a more exten-sive and interactive results table giving detailed informationabout every gene and term Both of these outputs are cus-tomizable and downloadable in a publication ready format

Dow




Nucleic Acids Research 2019 3

0

100

200

300

400

r107

0_e6

4_eg

11

r107

5_e6

5_eg

12

r113

7_e6

8_eg

15

r117

7_e6

8_eg

15

r118

5_e6

9_eg

16

r122

7_e7

2_eg

19

r127

0_e7

5_eg

22

r135

3_e7

8_eg

25

r139

5_e7

9_eg

26

r143

5_e8

0_eg

27

r144

0_e8

1_eg

28

r147

7_e8

2_eg

29

r148

8_e8

3_eg

30

r153

6_e8

3_eg

30

r161

5_e8

4_eg

31

r166

5_e8

5_eg

32

r170

3_e8

6_eg

33

r170

9_e8

7_eg

34

r173

0_e8

8_eg

35

r173

2_e8

9_eg

36

r174

1_e9

0_eg

37

r175

0_e9

1_eg

38

r176

0_e9

3_eg

40

e94_eg41_p11_14a0e05

gProfiler version

Am

ount

Number of available speciesstrains across gProfiler versions

Figure 1 The growth of available speciesstrains in gProfiler since 2012 Each column represents an archive version the current version is highlighted inblue The isolated peak in 1185 e69 eg16 refers to a version where we incorporated bacterial species to gProfiler but due to differences in the underlyingdatabase structure we dropped them again from further releases

gConvert ndash automatic conversion of gene identifiers

gConvert is able to convert between various gene proteinmicroarray probe and numerous other types of namespacesWe provide at least 40 types of IDs for more than 60 speciesthat we obtain from Ensembl Biomart (26) The most com-plex are human data where 98 different identifier typesare included currently All identifiers are obtained throughmatching them via Ensembl gene identifiers (ENSG) as areference gConvert also accepts mixed identifier types asinput

The output of gConvert is a table with converted IDsgene names and corresponding gene descriptions Users canalso retrieve the list of genes associated to any terms used ingGOSt (such as Reactome pathways HPA or GO biolog-ical process category) Other services such as the EnsemblBiomart (26) or Uniprot mapping service (27) include sim-ilar range of identifiers However gConvert accepts a mix-ture of identifiers and does not demand specifying originalidentifier type first by the user This improves the interop-erability and helps to connect external services beyond thegProfiler toolset

gOrth ndash mapping orthologous genes across species

gOrth makes use of orthologous gene mapping informa-tion from the Ensembl database It enables the user to re-trieve orthologous genes corresponding to their input genelist automatically The mapping is executed in two-stepmanner by first converting the input gene IDs provided bythe user to Ensembl ENSG identifiers and then retrieving

the corresponding orthologous gene information for tar-get species gOrth can be used to transfer the knowledgecollected for well studied model organisms to less studiedspecies Conducting enrichment analysis after orthologousmapping might lead to more comprehensible results than itwould be possible when using only the original species

gSNPense ndash SNP identifier mapping

gSNPense allows the user to easily map a list of humanSNP rs-codes (eg rs7961894) to gene names receive chro-mosomal coordinates and predicted variant effects Map-ping is enabled for such variants that overlap with at leastone protein coding Ensembl gene All underlying data areretrieved from the Ensembl Variation Data (28) The vari-ant effects are described with color-coded set of variant con-sequences terms defined by the Sequence Ontology (29)These terms convey the information about the effects thateach allele of the variant may have on each gene

Programmatic access

All the tools in gProfiler web server are accessible in GNUR and Python via dedicated software packages gprofiler2and gprofiler-official respectively These packages enablethe community to integrate gProfiler tools to different au-tomated pipelines or to easily access the results for othercustom visualizations Using these packages one can alsoimprove the reproducibility of the analyses as the parame-ters will be self-documented in the code The updated R and

Dow





Python packages utilize the new JSON APIs that now offera more standardized way to access gProfiler programmati-cally Moreover the new R package gprofiler2 provides thesame interactive visualizations as the ones available in theweb tool

Data and software archives

gProfiler follows a quarterly data release cycle of Ensemblwith up to a couple of months lag to update and verifyall our underlying datasets We support data sources thatare well established and continuously updated For the last8 years we have kept the complete archive of previous datareleases and respective software developed up until that mo-ment This allows users to repeat and validate the analyseswithout worrying about the changes in data or software thathave taken place over the time During these 8 years we havepublished 27 database versions and moved from supportingEnsembl version 62 to Ensembl version 94 (see Figure 1)

NEW DEVELOPMENTS IN GPROFILER IN 2019

With this update article we have introduced a completerewrite of the established service This brings faster andmore customizable back-end and modern user interface(Supplementary Figure S1)

Publication ready results

The highlight of the new release is an interactive Manhattanplot well known from GWAS studies and used for the firsttime for enrichment analysis according to our knowledgegGOSt returns this plot illustrating the enriched termsacross all the analysed term categories (Figure 2) The x-axisshows the functional terms and the corresponding enrich-ment P-values in negative log10 scale are illustrated on they-axis Each circle on the plot represents a single functionalterm The circles are color-coded by data source and size-scaled according to the number of annotated genes in thatterm The lighter circles represent insignificant terms andthe data sources that were not analysed (omitted by user)are in grey (Figure 2F)

The term location on the x-axis is fixed data sources aregrouped together and colour-coded for faster and more in-tuitive interpretation of the results (Figure 2B) Terms fromthe same GO subtree are located close to each other Thefixed location highlights the term groups from a particularGO subcategory as these terms tend to be enriched togetherforming a peak in the Manhattan plot (Figure 2C) In addi-tion it enables to easily spot the differences in the enrichedterms across several queries (Figure 2D)

Similarly to the previous versions gGOSt provides a re-sult table containing information about the enriched termsoverlap sizes and corresponding P-values With this updatethe results table is made interactive All the data can be fil-tered to highlight or limit the results according to a spe-cific keyword or term size Also columns in the results ta-ble are sortable and their order can be changed by draggingthe column headers In addition columns of limited inter-est can be dropped from the view by dragging them awayfrom the table This allows the user to pick and choose the

columns carrying the most relevant information before ex-tracting the data in publication ready image as PNG and asCSV or GEM files suitable for further analysis

In all the results tables we describe the adjusted P-valuesby an intuitive color-blind friendly viridis colour schemewhich runs from yellow representing insignificant findingsto dark blue with the smallest possible P-values (Figure 2E)

Flexible input

The input gene list can be composed of any commongeneprotein identifiers used by the life science communityin a mixed manner Thanks to gConvert running in thebackground the users can provide gene names mixed withprotein accession numbers probeset IDs chromosomal lo-cations and SNP IDs This provides the users the freedomto analyse their results straight-away without additionalmanual steps of transforming their platform specific iden-tifiers to a particular identifier type to get their first enrich-ment results Since this 2019 version gGOSt also accepts asan input the chromosomal intervals in browser extensibledata (BED) file format This makes it more straightforwardto integrate gGOSt with the annotation based browserslike UCSC Genome Browser and many DNA modificationanalysis tools that output BED files All the chromosomallocations are mapped to the latest Ensembl Genome versionthat gProfiler toolset is built upon Currently we supportthe human genome version GRCh38p12

Multiple queries

gGOSt input can consist of either a single gene list or aset of gene lists Starting from gProfiler 2011 version thiswas possible through a separate gCocoa tool For simplic-ity and better user convenience we merged this functional-ity into gGOSt We automatically evaluate the input typeand launch the enrichment analysis for all the provided genelists

The level of details in the output depends on the numberof query lists In case of a single list the results table includesevidence codes for each term-gene pair As the multi queryoption is useful for comparing multiple enrichment resultsthe table focuses on comparing P-values across those inputlists We highlight the terms that are significant for at leastone of the queries and allow an easy color-coded compari-son to pick up the most significant terms

In the multiple list query view we produce a separateManhattan plot for each of the input lists Despite produc-ing separate plots the resulting images are interlinked Thisallows to highlight the same term across multiple queries byhovering over terms in any of the individual plots

Data sources

We have changed the underlying data source for the miRNAtargets Previously we used Microcosm Targets predictionservice but it is not supported anymore This motivated usto move to experimentally validated information about mi-croRNA targets and now gGOSt includes dedicated inter-action database miRTarBase as a data source (22)

We have introduced with this update WikiPathways (21)as a complementary pathway data source WikiPathways

Dow





Figure 2 Example of gGOSt multiquery Manhattan plot (A) Name of the query (B) X-axis shows the functional terms grouped and colour-coded bydata source (C D) The position of terms in the plots is fixed and terms from the same (GO) branch are close to each other (E) P-values in the tableoutputs are color-coded from yellow (insignificant) to blue (highly significant) (F) Data sources that were not included in the analysis are shown in grey(G) Hovering over the term circle shows a tooltip with the most relevant information (H) In case of a multiquery the same term is also highlighted onother plots (I) The P-values from other queries are indicated in red next to the y-axis for easier comparison (J) A click allows to pin the circles to the plottogether with a numeric ID and creates a more detailed result table below the image

is a community driven initiative that shows consistent andsteady increase in annotated pathways reaching close to3000 pathways for approximately 30 organisms

Since this 2019 version of gGOSt we have enabled usersto use any other relevant data sources by accepting customGene Matrix Transposed file format (GMT files) CustomGMT file support enables users to find enriched terms thatmight be specific for their scientific topic or not regularlyupdated and therefore not selected as default data sourcefor gProfiler This allows us to focus our resources on pro-viding limited but high quality and up-to-date data whileproviding the flexibility of analysing other species or datato the users

Users can either compose GMT files themselves or usepre-compiled gene sets available for example at MolecularSignatures Database (MSigDB) (30) or at httpbaderlaborgGeneSets We also include example GMT files thatmight be useful for our users eg drugndashgene relationships

from PharmGKB (31) on a dedicated section of the Docu-mentation in gProfiler

Technical implementation

This release of gProfiler embodies an extensive update onboth the back and front end The core parts of the toolsethave been revised and rewritten in order to keep up with themodern development trends and to speed up calculationsThe back end is now reimplemented in Python 36 employ-ing the widely used packages such as numpy pandas scipyand statsmodels The biggest change in the back end liesin the data structure for keeping and accessing the anno-tations We replaced the previous BerkeleyDB engine withSQLite database and utilised the Roaring bitmaps (32) datastructure resulting in faster set operations

The REST API is developed in Python 36 using theFlask framework The user interface is mainly based onHTML5 and JavaScript and built with the open-source

Dow





Vuejs framework (httpsvuejsorg) The responsive resulttables are created using ag-Gridjs (httpswwwag-gridcom) library and D3js (httpd3jsorg) library is used forvisualization components

The new implementation is also better modularized mak-ing it easier to maintain and introduce new developments inthe future

Omitted resources

We have removed the gSorter functionality that allowedfinding similarly expressed genes in transcriptomics data asthis functionality is better supported by the Multi Experi-ment Matrix tool dedicated for gene expression similarityanalysis (12)

Due to the licencing changes we stopped providingthe Online Mendelian Inheritance in Man (OMIM) datasource We have also decided to remove the Biogrid networkvisualisation feature as there are better dedicated tools foranalysing and visualising such data eg Cytoscape (33)

USE CASE

To illustrate the several new features in gProfiler we haveused data published by Robert et al (34) They anal-ysed the total mRNA and the mRNA fraction associatedwith polysomes on human adipose tissue-derived stem cells(hASCs) at 24 h of osteogenesis induction Differentially ex-pressed genes (DEGs) were identified by paired compar-isons between non-induced cells and induced cell condi-tions and for polysomal and total RNA fractions GO en-richment analysis of DEGs from both total and polysomalRNA fractions using previous gProfiler version was partof their analysis pipeline The enrichment results were thenvisualized using REVIGO (35) Most of the biological pro-cesses were enriched in both of the groups and related toresponse to external stimuli cell communication and devel-opment

We repeated the enrichment analysis with the same lists ofDEGs in the updated version of gProfiler This use case isa typical example for gGOSt ordered multiquery analysisthat allows to compare the two sets of DEGs ordered bythe FDR value As a result we first see a Manhattan plot forboth of the two queries indicated by the query name (Figure2A)

The details of a term are shown in the tooltip that appearson hover (Figure 2G) The two plots are interlinked so thatthe corresponding term is also highlighted in the other plot(Figure 2H) and the P-value is shown next to the y-axis forbetter comparison (Figure 2I)

User can select circles to pin by clicking on them This at-tributes the circle with an identifier and results in a new rowto the descriptive table below the image showing the IDsnames and P-values for the selected terms (Figure 2J) Sim-ilar but more detailed table appears in the Detailed resultstab

With this use case we want to highlight that users cannow obtain publication ready images that illustrate the sig-nificant terms together with their P-values Our Manhattanplot visualization can serve as an alternative and more in-tuitive option for widely used REVIGO visualizations that

our users often have used to illustrate the findings providedby gProfiler

DISCUSSION

In this 2019 update of gProfiler we have introduced acomplete technical rewrite of our well established serviceThe new version of gProfiler toolset provides faster andmore intuitive service while keeping the technical robust-ness data quality and data update frequency at the samehigh quality level as before Since the initial publicationin 2007 the controlled vocabularies (eg GO) and knowl-edge bases (eg KEGG) have grown considerably in sizeand new data sources continue to emerge (WikiPathwaysDisGeNET) (2136) By including data about more speciesthe underlying data has grown by an order of magnituderequiring us to move towards better suited data structuresThis new architecture also made it possible to enable usersrsquoown GMT files to be used for enrichment analysis a requestthat our users have asked already for a while We have alsoimplemented a new responsive graphical user interface toimprove the user experience The customisability of the re-sults tables and highlighting options of the Manhattan plotallow users to prepare publication and presentation readyimages straight from the website

One of the goals of gProfiler is to continually help scien-tists to conduct their research in a reproducible and trans-parent manner Many web services including for enrich-ment analysis fall into oblivion after the initial releaseTherefore the users need to have a critical mind to recog-nize services that ignore data updates and might miss rel-evant new findings (37) We prevent this by paying a closeattention to data quality and short- and long-term availabil-ity

gProfiler with its fast and versatile components havebeen recognized by European Life Science InfrastructureELIXIR as a Recommended Interoperability Resourceto be of fundamental importance to the research infras-tructure of the life sciences (httpswwwelixir-europeorgplatformsinteroperabilityrirs) Our services allow to elim-inate an annoying and time consuming task majority oflife science researchers still face when trying to use webtools that accept only specific platform identifier types WithgProfiler toolkit we provide the identifier mapping and or-thologous gene conversion services for both the end usersand other service providers The variety of technical plat-forms that we support the interactive web application andthe JSON API the R and Python packages and Galaxytool should allow anyone to find a suitable way to incorpo-rate the interoperability services into their everyday prac-tice

gProfiler has been taught in several ELIXIR memberstates as well as in Canada and USA and plays a centralrole in a recently published pathway enrichment analysisand visualization of omics data protocol (38) gProfiler isalso a major component of an enrichment analysis basedautomatic clustering tool funcExplorer that benefits on pro-grammatic access from the gGOSt service (13)

Future developments of gProfiler will focus on provid-ing biologically more relevant results to the users On onehand it needs analysis and adaptation of alternative gene

Dow





set enrichment methods Since the functional terms are of-ten not independent from each other we need methods thatare capable of taking into account dependencies betweenthe terms and the overall graph structure of the ontologiesSome enrichment tools such as the topGO package in Rpropose methods that take the graph topology of GO intoaccount while testing the enrichment (39) However this isa remaining challenge that needs further research On theother hand the personalized medicine initiatives generateenormous data sets about SNPs and enrichment analysisis moving from gene level to single nucleotide level whenstudying associations with traits Therefore we aim to ex-pand our gSNPense tool to make use of the large datasetsfrom GWAS and eQTL analyses and to put their resultsinto context and highlight the statistically significant rela-tionships from there

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

ACKNOWLEDGEMENTS

We would like to thank our users for feature requests andfeedback that helps us to deliver better service We wouldlike to thank Juri Reimand for the significant contribu-tion to development and advisory roles to past versions ofgProfiler Special thanks go to Dmytro Fishman for creat-ing the graphical abstract We are also grateful to the peo-ple in our research group BIIT for their helpful discussionssuggestions and support

FUNDING

Estonian Research Council grants [IUT34-4 PSG59] Eu-ropean Regional Development Fund for CoE of Esto-nian ICT research EXCITE projects [2014-202040116-0271 ELIXIR] Funding for open access charge EestiTeadusagentuur [IUT34-4]Conflict of interest statement None declared

REFERENCES1 LachmannA RouillardAD MonteiroCD GundersenGW

JagodnikKM JonesMR KuleshovMV McDermottMGFernandezNF DuanQ et al (2016) Enrichr a comprehensive geneset enrichment analysis web server 2016 update Nucleic Acids Res44 W90ndashW97

2 WangJ VasaikarS ShiZ ZhangB and GreerM (2017)WebGestalt 2017 a more comprehensive powerful flexible andinteractive gene set enrichment analysis toolkit Nucleic Acids Res45 W130ndashW137

3 TripathiS PohlMO ZhouY Rodriguez-FrandsenA WangGSteinDA MoultonHM DeJesusP CheJ MulderLC et al(2015) Meta- and orthogonal integration of influenza rdquoOMICsrdquo datadefines a role for UBR4 in virus budding Cell Host Microbe 18723ndash735

4 XieC LiC-Y GaoG HuangJ WuJ KongL DongSMaoX DingY and WeiL (2011) KOBAS 20 a web server forannotation and identification of enriched pathways and diseasesNucleic Acids Res 39 W316ndashW322

5 YanH YouQ TianT YiX LiuY DuZ XuW and SuZ(2017) agriGO v20 a GO analysis toolkit for the agriculturalcommunity 2017 update Nucleic Acids Res 45 W122ndashW129

6 AshburnerM BallCA BlakeJA BotsteinD ButlerHCherryJM DavisAP DolinskiK DwightSS EppigJT et al(2000) Gene ontology tool for the unification of biology The GeneOntology Consortium Nat Genet 25 25ndash29

7 KohlerS GratianD BaynamG PalmerR DawkinsHSegalM JansenAC BrudnoM ChangWH FreemanAF et al(2018) Expansion of the Human Phenotype Ontology (HPO)knowledge base and resources Nucleic Acids Res 47 D1018ndashD1027

8 ReimandJ ArakT AdlerP KolbergL ReisbergS PetersonHand ViloJ (2016) gProfiler ndash a web server for functionalinterpretation of gene lists (2016 update) Nucleic Acids Res 44W83ndashW89

9 SubramanianA TamayoP MoothaVK MukherjeeSEbertBL GilletteMA PaulovichA PomeroySL GolubTRLanderES et al (2005) Gene set enrichment analysis aknowledge-based approach for interpreting genome-wide expressionprofiles Proc Natl Acad Sci USA 102 15545ndash15550

10 AfganE BakerD BatutB Van Den BeekM BouvierDCechM ChiltonJ ClementsD CoraorN GruningBA et al(2018) The Galaxy platform for accessible reproducible andcollaborative biomedical analyses 2018 update Nucleic Acids Res46 W537ndashW544

11 MetsaluT and ViloJ (2015) ClustVis a web tool for visualizingclustering of multivariate data using Principal Component Analysisand heatmap Nucleic Acids Res 43 W566ndashW570

12 AdlerP KoldeR KullM TkachenkoA PetersonH ReimandJand ViloJ (2009) Mining for coexpression across hundreds ofdatasets using novel rank aggregation and visualization methodsGenome Biol 10 R139

13 KolbergL KuzminI AdlerP ViloJ and PetersonH (2018)funcExplorer a tool for fast data-driven functional characterisationof high-throughput expression data BMC Genomics 19 817

14 KhatriP SirotaM and ButteAJ (2012) Ten years of pathwayanalysis current approaches and outstanding challenges PLoSComput Biol 8 e1002375

15 ReimandJ KullM PetersonH HansenJ and ViloJ (2007)gProfiler ndash a web-based toolset for functional profiling of gene listsfrom large-scale experiments Nucleic Acids Res 35 W193ndashW200

16 ReimandJ ArakT and ViloJ (2011) gProfiler ndash a web server forfunctional interpretation of gene lists (2011 update) Nucleic AcidsRes 39 W307ndashW315

17 FrankishA AbdulSalamAI VulloA ZadissaAWinterbottomA PartonA YatesAD ThormannA ParkerAMcMahonAC et al (2018) Ensembl 2019 Nucleic Acids Res 47D745ndashD751

18 HoweKL BoltBJ ShafieM KerseyP and BerrimanM (2017)WormBase ParaSite a comprehensive resource for helminthgenomics Mol Biochem Parasitol 215 2ndash10

19 MorishimaK TanabeM FurumichiM KanehisaM and SatoY(2018) New approach for understanding genome variations inKEGG Nucleic Acids Res 47 D590ndashD595

20 FabregatA JupeS MatthewsL SidiropoulosK GillespieMGarapatiP HawR JassalB KorningerF MayB et al (2017) TheReactome Pathway Knowledgebase Nucleic Acids Res 46D649ndashD655

21 SlenterDN KutmonM HanspersK RiuttaA WindsorJNunesN MliusJ CirilloE CoortSL DiglesD et al (2017)WikiPathways a multifaceted pathway database bridgingmetabolomics to other omics research Nucleic Acids Res 46D661ndashD667

22 LiangC TuS-J ChengY-N ChouC-H HuangW-CHuangH-D HuangP-C WengS-L HsuW-L TaiC-S et al(2017) miRTarBase update 2018 a resource for experimentallyvalidated microRNA-target interactions Nucleic Acids Res 46D296ndashD302

23 MatysV Kel-MargoulisOV FrickeE LiebichI LandSBarre-DirrieA ReuterI ChekmenevD KrullM HornischerKet al (2006) TRANSFAC and its module TRANSCompeltranscriptional gene regulation in eukaryotes Nucleic Acids Res 34D108ndashD110

24 UhlenM FagerbergL HallstromBM LindskogC OksvoldPMardinogluA SivertssonA KampfC SjostedtE AsplundAet al (2015) Tissue-based map of the human proteome Science 3471260419

Dow





25 BraunerB MontroneC FoboG FrishmanGDunger-KaltenbachI ReinhardJ GiurgiuM and RueppA (2018)CORUM the comprehensive resource of mammalian proteincomplexes2019 Nucleic Acids Res 47 D559ndashD563

26 KinsellaRJ KahariA HaiderS ZamoraJ ProctorGSpudichG Almeida-KingJ StainesD DerwentP KerhornouAet al (2011) Ensembl BioMarts a hub for data retrieval acrosstaxonomic space Database 2011 bar030

27 ConsortiumTU (2018) UniProt a worldwide hub of proteinknowledge Nucleic Acids Res 47 D506ndashD515

28 HuntSE McLarenW GilL ThormannA SchuilenburgHSheppardD PartonA ArmeanIM TrevanionSJ FlicekP et al(2018) Ensembl variation resources Database 2018 bay119

29 CunninghamF MooreB Ruiz-SchultzN RitchieGR andEilbeckK (2015) Improving the Sequence Ontology terminology forgenomic variant annotation J Biomed Semantics 6 32

30 LiberzonA SubramanianA PinchbackR ThorvaldsdttirHTamayoP and MesirovJ (2011) Molecular signatures database(MSigDB) 30 Bioinformatics 27 1739ndash1740

31 Whirl-CarrilloM McDonaghEM HebertJM GongLSangkuhlK ThornCF AltmanRB and KleinTE (2012)Pharmacogenomics knowledge for personalized medicine ClinPharmacol Ther 92 414ndash417

32 LemireD KaiGSY and KaserO (2016) Consistently faster andsmaller compressed bitmaps with Roaring arXiv doihttpsarxivorgabs160306549 21 March 2016 preprint not peerreviewed

33 ShannonP MarkielA OzierO BaligaNS WangJTRamageD AminN SchwikowskiB and IdekerT (2003)Cytoscape a software environment for integrated models ofbiomolecular interaction networks Genome Res 13 2498ndash2504

34 RobertAW AngulskiABB SpangenbergL ShigunovPPereiraIT BettesPSL NayaH CorreaA DallagiovannaB andStimamiglioMA (2018) Gene expression analysis of human adiposetissue-derived stem cells during the initial steps of in vitroosteogenesis Scientific Rep 8 4739

35 SupekF BosnjakM SkuncaN and SmucT (2011) REVIGOsummarizes and visualizes long lists of gene ontology terms PLoSOne 6 e21800

36 Gutirrez-SacristnA BravoA CentenoE SanzF PieroJGarca-GarcaJ Deu-PonsJ Queralt-RosinachN and FurlongLI(2016) DisGeNET a comprehensive platform integrating informationon human disease-associated genes and variants Nucleic Acids Res45 D833ndashD839

37 WadiL MeyerM WeiserJ SteinLD and ReimandJ (2016)Impact of outdated gene annotations on pathway enrichmentanalysis NatMethods 13 705

38 ReimandJ IsserlinR VoisinV KuceraM Tannus-LopesCRostamianfarA WadiL MeyerM WongJ XuC et al (2019)Pathway enrichment analysis and visualization of omics data usinggProfiler GSEA Cytoscape and EnrichmentMap Nat Protoc 14482ndash517

39 AlexaA and RahnenfuhrerJ (2010) topGO enrichment analysis forgene ontology R Package Version 2 2010

Dow





mation of gene lists (WebGestalt GSEA (29)) or use priorknowledge from gene regulation networks (WebGestalt (2))All of these methods have their own limitations and thereare no good benchmark data to assess and compare differ-ent methods (14) In order to serve the users an easy-to-useand fast enrichment tool gProfiler has been focusing onone approach only

Only very few of the tools dedicated to enrichment anal-ysis have offered continuous and up-to-date service aftertheir initial release gProfiler has remained vital since itsfirst publication in the 2007 NAR web server issue and haspublished update articles in 2011 and 2016 (81516) In or-der to continuously support researchers from a variety ofdifferent scientific domains we have increased the numberof supported species and gene identifier types while keep-ing the data update frequency programmable access andcore high quality data sources in steady manner through-out the years (Figure 1) As the complexity and size of theunderlying data has been growing we have now introduceda complete technical rewrite of the gProfiler This allowsus to serve our users faster and more conveniently throughmodern user interface and programming interfaces as wellas opens up new avenues for adding features and maintain-ing the stable service

GPROFILER WEB SERVER

gProfiler (httpsbiitcsuteegprofiler) is a collection oftools that are commonly used in standard pipelines ofbiological entity (geneprotein) centered computationalanalysis gGOSt performs the functional enrichmentanalysis of individual or multiple gene lists gConvertmaps geneprotein identifiers between various namespacesgOrth allows to map orthologous genes across species andgSNPense maps human SNP identifiers to genes

gGOSt ndash functional enrichment analysis

gGOSt is the core tool for performing functional enrich-ment analysis on input gene list It maps a user providedlist of genes to known functional information sources anddetects statistically significantly enriched biological pro-cesses pathways regulatory motifs and protein complexesMain source of information about genes identifier typesand GO terms and associations is the Ensembl database(17) The Gene Ontology is the most comprehensive re-source in gGOSt covering 467 species and strains repre-sented in Ensembl Genomes and fungi plants or metazoaspecific versions of Ensembl (17) and parasite specific datafrom WormBase (18)

We include most common data sources that are also regu-larly updated This covers currently KEGG (19) Reactome(20) and WikiPathways (21) miRNA targets from miRTar-Base (22) and regulatory motif matches from TRANSFAC(23) tissue specificity based on expression data from Hu-man Protein Atlas (24) protein complexes data from CO-RUM (25) and human disease phenotype associations fromHuman Phenotype Ontology (7) Users are free to chooseany combination of the provided data sources This allowsusers for example to focus either on cellular componentsfrom the Gene Ontology or disease information from the

Human Phenotype Ontology For all data sources exceptfor two (KEGG and Transfac due to the licensing reasons)users can also download all the underlying terms and asso-ciations

The functional enrichment of the input gene list is evalu-ated using the well-proven cumulative hypergeometric testFor a single gene list we are testing numerous functionalterms at a time For example already gt16 000 GO biolog-ical process terms are considered for a human gene list Toreduce the amount of false positive findings gGOSt per-forms multiple testing correction By default we use theoriginal gSCS (Set Counts and Sizes) correction methodintroduced back in 2007 (15) Unlike the standard meth-ods gSCS considers the dependency of multiple tests bytaking into account the overlap of functional terms Werepeated the simulations for verifying the gSCS methodand confirmed that the method still holds the same prop-erties Namely gSCS is more conservative than Benjamini-Hochberg False Discovery Rate but not as strict as Bonfer-roni correction Instead of the default gSCS method usercan also choose to apply the Bonferroni correction or theBenjaminindashHochberg False Discovery Rate It is notewor-thy that in gGOSt we report only the adjusted enrichmentP-values

The input gene list can be either unordered or orderedBy default the list is considered unordered and complete Ifthere exists some natural ranking of the genes the orderedgene list option can be used For example the fold changein between different biological conditions the number ofneighboring nodes in a protein-protein interaction networkthe absolute expression values or any other information rel-evant for the experimental setup could be a good basis forsuch ranking In such cases the hypergeometric testing isperformed for each possible prefix of the list starting fromthe first gene and adding next genes one by one The smallestenrichment P-value is reported for each of the terms alongwith corresponding gene list length For different terms thislength can vary especially as the broader terms can be en-riched for larger lists only

The input can also consist of multiple lists simultaneouslyusing a FASTA format like symbol in front of every listThis multi-list analysis feature was previously implementedin the gCocoa tool

By default gGOSt uses the set of all annotated protein-coding genes as a background In some experiments how-ever only a subset of genes or proteins is measured For ex-ample the targeted sequencing of only disease specific geneswould imply enrichment of that disease association Sta-tistically for these cases it is recommended and sometimesnecessary to use custom background information when cal-culating the statistical enrichment significance The custombackground should include a list of genes that were actu-ally measured during the biological experiment such as allgenes in the sequencing panel This option allows to calcu-late a more precise evaluation of functional enrichment

The enrichment results in gGOSt are first highlighted ina novel Manhattan plot It is accompanied by a more exten-sive and interactive results table giving detailed informationabout every gene and term Both of these outputs are cus-tomizable and downloadable in a publication ready format

Dow





0

100

200

300

400

r107

0_e6

4_eg

11

r107

5_e6

5_eg

12

r113

7_e6

8_eg

15

r117

7_e6

8_eg

15

r118

5_e6

9_eg

16

r122

7_e7

2_eg

19

r127

0_e7

5_eg

22

r135

3_e7

8_eg

25

r139

5_e7

9_eg

26

r143

5_e8

0_eg

27

r144

0_e8

1_eg

28

r147

7_e8

2_eg

29

r148

8_e8

3_eg

30

r153

6_e8

3_eg

30

r161

5_e8

4_eg

31

r166

5_e8

5_eg

32

r170

3_e8

6_eg

33

r170

9_e8

7_eg

34

r173

0_e8

8_eg

35

r173

2_e8

9_eg

36

r174

1_e9

0_eg

37

r175

0_e9

1_eg

38

r176

0_e9

3_eg

40

e94_eg41_p11_14a0e05

gProfiler version

Am

ount











Programmatic access


Dow
















Flexible input


Multiple queries




Data sources



Dow













Dow







Omitted resources



USE CASE







DISCUSSION






Dow






SUPPLEMENTARY DATA


ACKNOWLEDGEMENTS


FUNDING



























Dow




















Dow





0

100

200

300

400

r107

0_e6

4_eg

11

r107

5_e6

5_eg

12

r113

7_e6

8_eg

15

r117

7_e6

8_eg

15

r118

5_e6

9_eg

16

r122

7_e7

2_eg

19

r127

0_e7

5_eg

22

r135

3_e7

8_eg

25

r139

5_e7

9_eg

26

r143

5_e8

0_eg

27

r144

0_e8

1_eg

28

r147

7_e8

2_eg

29

r148

8_e8

3_eg

30

r153

6_e8

3_eg

30

r161

5_e8

4_eg

31

r166

5_e8

5_eg

32

r170

3_e8

6_eg

33

r170

9_e8

7_eg

34

r173

0_e8

8_eg

35

r173

2_e8

9_eg

36

r174

1_e9

0_eg

37

r175

0_e9

1_eg

38

r176

0_e9

3_eg

40

e94_eg41_p11_14a0e05

gProfiler version

Am

ount











Programmatic access


Dow
















Flexible input


Multiple queries




Data sources



Dow













Dow







Omitted resources



USE CASE







DISCUSSION






Dow






SUPPLEMENTARY DATA


ACKNOWLEDGEMENTS


FUNDING



























Dow




















Dow
















Flexible input


Multiple queries




Data sources



Dow













Dow







Omitted resources



USE CASE







DISCUSSION






Dow






SUPPLEMENTARY DATA


ACKNOWLEDGEMENTS


FUNDING



























Dow




















Dow













Dow







Omitted resources



USE CASE







DISCUSSION






Dow






SUPPLEMENTARY DATA


ACKNOWLEDGEMENTS


FUNDING



























Dow




















Dow







Omitted resources



USE CASE







DISCUSSION






Dow






SUPPLEMENTARY DATA


ACKNOWLEDGEMENTS


FUNDING



























Dow




















Dow






SUPPLEMENTARY DATA


ACKNOWLEDGEMENTS


FUNDING



























Dow




















Dow




















Dow




Documents

BIIT - g:Profiler: a web server for functional enrichment analysis … · 2020. 10. 22. · 4 NucleicAcidsResearch,2019 PythonpackagesutilizethenewJSONAPIsthatnowoffer amorestandardizedwaytoaccessg:Profilerprogrammati-cally.Moreover