RAPYD — Rapid Annotation Platform for Yeast Data

R

JHa

b

a

ARRAA

KYGMCM

1

foowyiTt2ai

j(khta

0d

Journal of Biotechnology 155 (2011) 118– 126

Contents lists available at ScienceDirect

Journal of Biotechnology

j ourna l ho me pag e: www.elsev ier .com/ locate / jb io tec

APYD — Rapid Annotation Platform for Yeast Data

essica Schneidera,b, Jochen Bloma, Sebastian Jaenickea, Burkhard Linkea, Karina Brinkrolfb,eiko Neuwegera, Andreas Tauchb, Alexander Goesmanna,∗

Computational Genomics, Institute for Bioinformatics, Center for Biotechnology, Bielefeld University, Bielefeld, GermanySystems Biology of Regulatory Networks, Institute for Genome Research and Systems Biology, Center for Biotechnology, Bielefeld University, Bielefeld, Germany

r t i c l e i n f o

rticle history:eceived 17 August 2010eceived in revised form 13 October 2010ccepted 22 October 2010vailable online 30 October 2010

eywords:eastsenome annotation

a b s t r a c t

Lower eukaryotes of the kingdom Fungi include a variety of biotechnologically important yeast speciesthat are in the focus of genome research for more than a decade. Due to the rapid progress in ultra-fastsequencing technologies, the amount of available yeast genome data increases steadily. Thus, an efficientbioinformatics platform is required that covers genome assembly, eukaryotic gene prediction, genomeannotation, comparative yeast genomics, and metabolic pathway reconstruction. Here, we present abioinformatics platform for yeast genomics named RAPYD addressing the key requirements of extensiveyeast sequence data analysis. The first step is a comprehensive regional and functional annotation of ayeast genome. A region prediction pipeline was implemented to obtain reliable and high-quality predic-

etabolic pathwaysomparative genomicseyerozyma guilliermondii

tions of coding sequences and further genome features. Functions of coding sequences are automaticallydetermined using a configurable prediction pipeline. Based on the resulting functional annotations, ametabolic pathway reconstruction module can be utilized to rapidly generate an overview of organism-specific features and metabolic blueprints. In a final analysis step shared and divergent features of closelyrelated yeast strains can be explored using the comparative genomics module. An in-depth applicationexample of the yeast Meyerozyma guilliermondii illustrates the functionality of RAPYD. A user-friendly

at ht
web interface is available
. Introduction

Yeasts are classified in the kingdom Fungi. They are essentialor a variety of biotechnological applications like the productionf feed additives to increase the nutrient value (Kaur et al., 2010)r the improvement of drink and food fermentation for instance ofine (Rossouw et al., 2010) or bread (Yeh et al., 2009). Furthermore,

east strains are used to facilitate the synthesis of products withndustrial relevance like ethanol (Suwannarangsee et al., 2010;omas-Pejo et al., 2010) and proteins of pharmaceutical interesto support human immunity (Conesa et al., 2010; Wang et al.,
009). In addition, the understanding of yeast pathogenicity playsn important role as some species affect human health especially inmmunocompromized patients (Satyanarayana and Kunze, 2009).
∗ Corresponding author.E-mail addresses: [email protected] (J. Schneider),

[email protected] (J. Blom), [email protected]. Jaenicke), [email protected] (B. Linke),[email protected] (K. Brinkrolf),[email protected] (H. Neuweger),[email protected] (A. Tauch),[email protected] (A. Goesmann).

168-1656/$ – see front matter © 2010 Elsevier B.V. All rights reserved.oi:10.1016/j.jbiotec.2010.10.076

tps://rapyd.cebitec.uni-bielefeld.de.© 2010 Elsevier B.V. All rights reserved.

The analysis of yeast-specific genome features based on ultra-fast sequencing techniques represents the state of the art approachfor the systematic genetic investigation of yeast strains. Threemajor tasks have to be addressed using bioinformatics approaches:(a) the genome has to be annotated subsequent to the initialassembly of raw sequence data obtained from high-throughputsequencing, (b) metabolic pathway reconstruction has to be per-formed to get insights into metabolic capabilities, lifestyle, orpathogenicity of an organism, and (c) comparative genomicsapproaches have to be conducted to produce sets of shared or diver-gent genes of a selected set of organisms. These major aspects forthe interpretation of yeast genome data will be discussed in thefollowing.

The initial step in a sequence analysis pipeline is the genomeannotation. This comprises two steps, firstly the identificationof regions of interest like coding sequences, rRNAs, and tRNAs,secondly the assignment of a detailed function description foreach detected region. Genome analysis and data interpretationpipelines for bacteria are available for more than a decade and they
are today successfully used in many projects. Available genomeannotation systems that automate this process are for instanceGenDB (Meyer et al., 2003), AGMIAL (Bryson et al., 2006) andGAMOLA (Altermann and Klaenhammer, 2003). They are restrictedto prokaryotic genomes and cannot handle coding sequences
dx.doi.org/10.1016/j.jbiotec.2010.10.076

http://www.sciencedirect.com/science/journal/01681656

http://www.elsevier.com/locate/jbiotec

https://rapyd.cebitec.uni-bielefeld.de/

mailto:[email protected]








dx.doi.org/10.1016/j.jbiotec.2010.10.076

Biotech

wctE(2tg

ssageiaatsMfir(afdmtet(oaattap

brcV(a(aCafSgipttv

yttiiPau

J. Schneider et al. / Journal of

ith intron and exon structures or multiple contigs representinghromosomes of the same genome. Hence, dedicated annota-ion systems for yeasts and higher eukaryotes were developed.xamples for eukaryotic genome annotation systems are PEDANTWalter et al., 2009), Ensembl (Fernandez-Suarez and Schuster,010) and MIPS (Mewes et al., 2008). These systems are connectedo related genome databases and enable automatic analyses andenome annotations.

Following the regional and functional annotation, one nexttep for an in-depth analysis of genomic data is the recon-truction of metabolic pathways. To maintain this widely usedpproach for fungi, various tools like FUNGIpath that predicts fun-al metabolic pathways by orthology were developed (Grossetetet al., 2010). Moreover, biochemical pathway analysis is integratedn the Candida Genome Database (Skrzypek et al., 2010) as wells the Saccharomyces Genome Database (Dwight et al., 2002), bothlso providing organism-specific gene annotations. These softwareools are applicable for published genomes only and the corre-ponding pathway maps are not generated in the Systems Biologyarkup Language (SBML), which is nowadays a common standard

le format in systems biology (Hucka et al., 2003). Such a metaboliceconstruction can be created by various tools such as KEGG2SBMLFunahashi et al., 2004), the KEGGConverter (Moutselos et al., 2009)nd CARMEN (Schneider et al., 2010). The major information sourceor their reconstruction processes is KEGG, the Kyoto Encyclope-ia of Genes and Genomes (Ogata et al., 1999). Another universaletabolic pathway database is MetaCyc that contains experimen-

ally verified metabolic pathways and enzyme information (Kriegert al., 2004). The derived BioCyc database is a collection of morehan 500 organism-specific Pathway/Genome Databases (PGDBs)Caspi et al., 2010). Each PGDB contains metabolic networks of onerganism predicted by the Pathway Tools software using MetaCycs a reference database. One example is YeastCyc that was cre-ted for Saccharomyces cerevisiae. Pathway Tools enables the usero establish local versions of metabolic pathway databases withhe focus on a single genome of interest. Thus, the possibilities for

detailed comparison of various (draft) genomes within such aroject are limited.

Due to the dramatic decrease of sequencing costs it is feasi-le to sequence not only single genomes, but also sets of closelyelated strains. As a corollary, there is an increased demand foromparative software tools and databases. Generally, tools likeISTA (Frazer et al., 2004), xBASE (Chaudhuri et al., 2008) or GeCont

Martinez-Guerrero et al., 2008) can be used for microbes as wells comparative databases like CMR (Peterson et al., 2001) or MBGDUchiyama, 2003). In the context of comparative yeast genomics

recent example is the analysis of eight sequenced pathogenicandida yeasts to get insights into the evolution of pathogenicitynd sexual reproduction (Butler et al., 2009). Bioinformatics toolsor comparative data interpretation are CFGP (Park et al., 2008),NUGB (Jung et al., 2008), and e-Fungi (Hedeler et al., 2007). Theenerally applicable tool EDGAR (Blom et al., 2009) was initiallymplemented for bacterial genomes. EDGAR facilitates the com-utation of genomic subsets like the core genome, singletons andhe pan-genome. Additionally, it provides a comparative view onhe genomic neighborhood of orthologous genes as well as variousisualization options.

As described above, a variety of individual annotation and anal-sis tools tailored towards yeast genomics are available. However,he combination and integration of results from genome annota-ion, comparative genomics and metabolic pathway reconstruction
s often cumbersome and requires substantial manual effort. Thus,n this work a novel platform named RAPYD (Rapid Annotationlatform for Yeast Data) is described connecting universal genomenalysis and interpretation methods. The platform supports thenderstanding of yeast specific features, its lifestyle, and metabolic
nology 155 (2011) 118– 126 119

blue-prints by applying and combining regional, functional, andcomparative genomics as well as metabolic pathway reconstruc-tions.

2. Methods and implementation

For efficient handling and bioinformatic interpretation of yeastgenome data, the RAPYD uses three major software modules foryeast annotation, comparative genomics, and metabolic pathwayreconstruction. A user-friendly web interface of the RAPYD wasestablished at https://rapyd.cebitec.uni-bielefeld.de.

2.1. System design and technical details

The RAPYD is based on a three-tier architecture that embedsthe database layer, the business logic layer, and the presentationlayer being realized as a web frontend (Fig. 1a). The database layerprovides an object-relational data model represented in an under-lying MySQL database. The business logic of the RAPYD modulesprovides structured access to the modeled objects. Access controlis handled by the General Project Management System (GPMS) thatis widely used for several CeBiTec (Center for Biotechnology) soft-ware applications. GPMS supports the management of multipleprojects and provides a role-based authorization model. Especiallywith regard to unpublished genomes, it is crucial to control accessto the genomic data. For this purpose, RAPYD requires authenti-cation and authorization by user name and password. To enableresearchers to test the RAPYD, a demonstration project is available.Individual projects can be set up upon request.

The web interface of RAPYD is based on CGI scripts runningon an Apache server. It provides a central entry page controllingaccess to genomic data and the functionality of the three integratedand interlinked RAPYD modules for yeast annotation, comparativegenomics, and metabolic pathway reconstruction (Fig. 1b). As astarting point, assembled contigs or scaffolds can be imported intothe yeast annotation module, which also features a region and afunction prediction pipeline for accurate genome annotation. Con-nections to the comparative genomics module and the metabolicreconstruction module are provided by the project managementsystem. Each module enables the export of generated data in var-ious formats: FASTA sequences can be downloaded for annotatedregions and EMBL or GBK files can be generated to store regionand function predictions for one organism. Using the compara-tive genomics module for yeasts enables the export of computedgene sets in diverse output formats among different organisms orspecies. Finally, the metabolic reconstruction module is able to usegenome annotation data or comparative genomics data to generaterepresentations of metabolic pathways that can be exported as SVGimages or standardized SBML files.

2.2. RAPYD yeast annotation module

The main component of the RAPYD is derived from the prokary-otic genome annotation framework GenDB (Meyer et al., 2003).This system is based on an object-oriented data model and relieson a relational database. Initially, it was developed for microbialgenomes. To establish a component for yeast genomes with simi-lar functions like GenDB, various extensions and adaptations wereimplemented. The RAPYD yeast annotation module enables thestorage of eukaryotic features such as intron and exon structuresand it can now handle multiple contigs/chromosomes that belong
to the same genome (strain). These backend adaptations were alsoincorporated into the web frontend visualizing eukaryotic gene andgenome features (Fig. 2a and b).
The annotation process of a draft genome starts with a sequenceimport into the platform. Depending on the number of assembled

https://rapyd.cebitec.uni-bielefeld.de/

120 J. Schneider et al. / Journal of Biotechnology 155 (2011) 118– 126

Fig. 1. System design of the RAPYD. (a) Three-tier architecture of the RAPYD covering the web frontend, business logic and data backend. (b) Combination of the RAPYD corem nctionc nstrucm

cftsppteopdGAaw(

pds(uea2vtg(ht2attFtl

odules and their functionality. The annotation module comprises a region and a fuompute various gene sets and visualization methods. The metabolic pathway recoetabolic pathways stored in SVG or SBML format.

ontigs or scaffolds, it can be useful to connect single sequenceragments by a 6-frame-stop-linker (e.g. CTAGCTAGCTAG) prioro database import. This procedure avoids a prediction of genespanning gaps between scaffold or contig ends. For the inter-retation of genomic data, a regional and a functional analysisipeline were implemented within RAPYD. The new region predic-ion pipeline includes a default gene prediction by Augustus (Stanket al., 2006; Stanke and Waack, 2003) using a provided training setf the model organism S. cerevisiae. Moreover, several other generediction tools were integrated to allow for customized gene pre-iction strategies, e.g. Genemark (Ter-Hovhannisyan et al., 2008),eneID (Parra et al., 2000) and GlimmerM (Salzberg et al., 1999).nother important feature of this pipeline is the prediction of tRNAnd rRNA genes. This was realized by the integration of the soft-are tools ARAGORN (Laslett and Canback, 2004) and RNAMMER

Lagesen et al., 2007).After the region prediction step, a novel functional prediction

ipeline termed Metanor-Euk is executed. The pipeline individuallyesigned for RAPYD is based on observations computed for codingequences and genes by several customized sequence analysis toolsFig. 2c). The pipeline includes BLAST comparisons to continuouslypdated databases such as the NCBI (nt and nr), Swissprot (Boutett al., 2007), KEGG (Ogata et al., 1999), KOG (Tatusov et al., 2003),nd the conserved domain databases (CDD) (Marchler-Bauer et al.,009). Hidden Markov model based sequence analysis is performedia the tool hmmpfam against the PFAM (Finn et al., 2010) andhe TIGRFAMS databases (Haft et al., 2003). In addition, the inte-rated resource of protein domains and functional sites InterProHunter et al., 2009) is queried. Signaling peptides, transmembraneelices, and helix–turn–helix motifs in proteins are predicted usinghe tools SignalP (Bendtsen et al., 2004), TMHMM (Chaudhuri et al.,008), and helix–turn–helix (Rice et al., 2000), respectively. Thectual computation of these tools is performed on a compute clus-
er and all relevant results are stored as observation objects inhe object-relational database of the RAPYD annotation module.inally, the Metanor-Euk annotation module combines and filtershese results using a rule based approach. Therefore, confidenceevels have been defined for each tool used in the Metanor-Euk
prediction pipeline. Based on these results, the comparative genomics module cantion uses annotated EC numbers of the annotation module to create user-selected

pipeline. The manually curated SwissProt database achieves, e.g.the highest confidence level for the BLAST tools. In the automatedannotation process, the results with the highest confidence levelcontaining a gene product, name and EC number are used for thefunctional annotation of a coding sequence. In general, a conserva-tive E-value threshold of 1e−30 is applied and genes without reliableBLAST hits to known databases will be annotated as ‘hypotheticalproteins’. Hypothetical proteins with SignalP or TMHMM predic-tions will automatically be named ‘putative secreted protein’ or‘putative membrane protein’, respectively. To improve the annota-tion of closely related organisms, individual gene predictions andannotations can be generated by mapping existing reference dataon a genome of interest.

A second field of application is the import and re-annotationof already published genomes for further analysis. Besides theimprovement of regional or functional predictions (re-annotation),metabolic pathway reconstructions and comparative features canbe performed. By this means, a collection of yeast genomes withina RAPYD project can be utilized for comparative analyses like thecomputation of the core or pan-genome, as well as for the detectionof singletons.

2.3. RAPYD metabolic pathway reconstruction module

For metabolic reconstruction of a selected organism within aRAPYD project, a new module using extended functionality of theCARMEN software (Schneider et al., 2010) was integrated. Thismodule enables researchers to reconstruct metabolic networks bygathering annotation information of all associated chromosomesof an organism of interest. Annotated EC numbers, gene identi-fiers, and gene names predicted by the annotation module can becombined and mapped onto a collection of pathway maps includ-ing computerized information about graphical objects and their
relations stored in the KEGG markup language (KGML) providedby the KEGG database (Ogata et al., 1999). The reconstructionand visualization of distinct pathway maps is used to obtain arapid overview of the metabolic capabilities of an organism. It alsoidentifies metabolic pathways that are not completely covered by

J. Schneider et al. / Journal of Biotechnology 155 (2011) 118– 126 121

F d conm d usind tatus f

cdcSftiraureootmr(

2

E

ig. 2. Main page of the genome annotation module displaying the contig view anenu. (b) Information about predicted exon and intron structures can be visualize

isplays pipeline-associated tools, the number of related jobs and their execution s

orresponding genes, to show up the disability to synthesize or toegrade certain metabolites. After finishing the reconstruction pro-ess, a preview image of the generated network is provided as anVG image; the actual pathway model can be downloaded in SBMLormat (Levels 2.1, 2.4, and 3.1). SBML enables the storage of addi-ional information, which cannot be displayed by a static picturen its full dimension like gene identifiers, EC numbers, or KEGGeaction identifiers. This information can directly be exploited via

database connection within an SBML editor, for example bysing the CellDesigner software (Kitano et al., 2005). The pathwayeconstruction module of RAPYD therefore comprises a fast andfficient way of visualizing and analyzing the metabolic featuresf an organism of interest. Furthermore, the generated SBML filesffer automated access to modeling and simulation tools such ashe CellNet Analyser (Klamt et al., 2007). The functionality of the

etabolic pathway reconstruction module is available via the cor-esponding menu item at the top of the main annotation windowsee Fig. 2).

.4. RAPYD comparative genomics module

The comparative genomics module of RAPYD is based on theDGAR (Blom et al., 2009) platform for bacterial comparative

tig as well as region information. (a) Chromosomes can be selected via the contigg the action button ‘Show exon intron structure’. (c) The ‘Job information’ windowor a selected chromosome.

genomics. EDGAR was initially designed for the comparison ofsimple prokaryotes with a single chromosome thus lacking sup-port for multi replicon organisms. Within the RAPYD framework,an extension of the data model by a supercontig-class was imple-mented that maintains the association of chromosomes to a sourceorganism. This allows for the calculation of all comparisons on sin-gle replicon level as well as on complete genomes with multiplechromosomes. An all-against-all genome BLAST comparison is exe-cuted automatically to obtain orthologous genes of the integratedorganisms. Within the RAPYD, the main purpose of the compar-ative genomics module is the calculation of genomic subsets likecore genome, pan-genome, and singleton genes. Furthermore, itincludes several visualization features like Venn diagrams or syn-teny plots. In addition, a synchronization of the data models ofthe annotation module and the comparative module was realized.This allows for the exchange of results directly between both com-ponents. For instance, the user can select a consensus annotationfor a set of orthologous genes that will be assigned to all of these
genes in the annotation platform. This simplifies the annotationprocess for strains with a closely related reference genome thatwas already sequenced and annotated. All results can be down-loaded in various formats such as FASTA or tab-separated lists. Thecomparative genomics module can be executed using the corre-

1 Biotec

saaamcoc

3

caWcioaitgeiep2

sGStswT(mcxbaddccFfi

3t

aigsfiicv

pHa

22 J. Schneider et al. / Journal of

ponding menu item of the main annotation window resulting in selection list of available comparative analyses (see Fig. 2). Inddition, reconstructed metabolic pathways can be enriched withdditional information calculated by the comparative genomicsodule. Genomic subsets like core genome or singleton genes

an be visualized by the corresponding calculation routinesf the comparative module and visualized in their metabolicontext.

. Results and discussion

We have developed a novel specialized bioinformatics platformalled RAPYD for the analysis of yeast genomic data that can bepplied for several approaches on yeast genomes within one day.ith the efficiency of high throughput sequencing, this platform

an first of all be used for a de novo annotation and subsequentnterpretation of unfinished (draft) genomes. This application isf special interest, since finishing of eukaryotic genomes is timend cost consuming. Second, this platform can be utilized for themprovement of annotation data, metabolic pathway reconstruc-ion, and comparative analysis of already published and finishedenomes. The automatic annotation process provided by RAPYDnables the rapid update of functional annotations with regard toncreasing numbers of eukaryotic sequencing projects and databasextensions. A detailed workflow to pass through the annotationrocess of both draft and finished genomes is described in Section.3.

To demonstrate the usefulness of the RAPYD, a publicly acces-ible project was set up including genomic data of Pichia pastorisS 115 (De Schutter et al., 2009) and Pichia stipitis (synonymcheffersomyces stipitis) CBS 6054 (Jeffries et al., 2007). In addi-ion, the recently sequenced and so far unfinished draft genomeequence of Candida guilliermondii ATCC 6260 (Butler et al., 2009)as integrated to demonstrate the analysis capabilities of RAPYD.

his species was recently renamed Meyerozyma guilliermondiiKurtzman and Suzuki, 2010), a designation used throughout this

anuscript. Several M. guilliermondii strains are of biotechnologi-al interest as they are producing riboflavin or efficiently convertylose to the anti-caries sweetener xylitol (Zou et al., 2010). Besidesiotechnologically relevant features, other M. guilliermondii strainslso have pathogenic potential (De Vos et al., 2005). Focusing on theraft genome of M. guilliermondii, we applied RAPYD for detailedata interpretation to get insights into lifestyle and metabolicapabilities of this yeast. As an application example, the centralarbohydrate metabolism pathway of glycolysis was reconstructed.urthermore, comparative genomics of M. guilliermondii were per-ormed based on the published genomes of the biotechnologicallymportant strains P. pastoris and P. stipitis.

.1. Functional genome annotation for M. guilliermondii usinghe RAPYD annotation module

To demonstrate the use of RAPYD on draft genomes, the avail-ble genome assembly of M. guilliermondii consisting of 71 contigsn 8 scaffolds was used (accession numbers CH408155–CH408162,enome project AAFM00000000). Based on the scaffolds, codingequences were already predicted, with the majority of CDS classi-ed as hypothetical proteins (Butler et al., 2009). These data were

mported into the RAPYD including the region predictions for 5915oding sequences and 120 tRNAs, both displayed on the contig
iew.
To improve the reference gene prediction within the RAPYDroject, additional ab initio gene finders such as Genemark (Ter-ovhannisyan et al., 2008) and Augustus (Stanke et al., 2006; Stankend Waack, 2003) were executed to get further hints for reliable

hnology 155 (2011) 118– 126

gene structures. Genemark is a self-training algorithm that pre-dicted 5208 genes across the imported scaffolds. Augustus wasexecuted using the pre-computed training data set ‘Candida guil-liermondii’ and led to 5667 predicted genes. Results of both tools arevisualized for each gene on the contig view as white colored barsrepresenting observations that can provide additional informationon the correct gene structures (Fig. 3).

The intersection of genes annotated in M. guilliermondii andgenes equally predicted by Augustus is 4392 genes (74.2%), whileGenemark predicts only 3199 genes (54.0%) that are in exactaccordance with the reference annotation. Allowing start and stopvariations of ± 100 bases, the number of genes overlapping withthe reference annotations increases to 4752 genes (80.2%) and3788 genes (64.0%) for Augustus and Genemark, respectively. Datadiscrepancies can be used as starting points for further manualanalysis. For example, the gene finders do normally not predictoverlapping genes, but the reference annotation contains in total308 overlapping genes (Fig. 3a). In addition, gene finder predic-tions exist without annotated reference genes (Fig. 3b) and somereference genes are not discovered by the gene finders (Fig. 3c).Especially Genemark tends to predict longer genes, which areseparated into two or more genes within the reference genome(Fig. 3d). These findings can be used for further investigations toimprove the gene prediction. This process might be automated bydeveloping a combined gene prediction strategy for specific yeastsconsulted by manual experience. Such an implementation can eas-ily be integrated into the RAPYD framework to support regionalpredictions.

A functional annotation was missing so far for the major-ity of the 5915 imported M. guilliermondii genes. Twenty-two ofthese genes were imported with regional status ‘attention needed’because of lacking start or stop positions due to the currentdraft status of the genome assembly. The functional predictionpipeline of RAPYD was executed based on the imported gene loca-tions to obtain gene functions, gene names and EC numbers. Thisprocedure resulted in 4150 genes with certain functional descrip-tions (excluding 427 uncharacterized proteins). Only 1071 geneswere annotated as hypothetical proteins, 99 genes as putativemembrane proteins, and 168 genes as putative secreted pep-tides. In addition, 2530 gene names and 2053 EC numbers wereannotated.

3.2. Reconstruction of the glycolysis pathway of M. guilliermondii

The functional annotation of M. guilliermondii genes (see Sec-tion 3.1) serves as a basis for a metabolic pathway reconstruction.For this yeast, 2053 EC numbers were automatically annotatedand used by the metabolic reconstruction module for an auto-mated analysis of diverse metabolic pathway maps provided byKEGG (Ogata et al., 1999). One of these reconstructions wasconducted for the universally present metabolic pathway of gly-colysis (Fig. 4a). During pathway reconstruction, genome wideassigned EC numbers and associated genes were mapped ontothe selected KEGG map (KEGG map 00010). This provides a hugeadvantage in pathway annotation and analyses, especially whenthe respective genes are widely spread throughout the genome.In case of the glycolysis of M. guilliermondii, only two putativeenzymes, a triosephosphate isomerase (TPI, PGUG 01405) and aphosphoglyceromutase (PGUG 01406) are encoded by adjacentgenes. The remaining genes are randomly distributed through-out the genomic scaffolds. Even the phosphofructokinase subunits
� PFK1 (PGUG 03026, CH408157) and � PFK2 (PGUG 4679,CH408160) are located on different scaffolds. After a pathwayreconstruction run, a pathway preview is shown and variousoutput formats can be downloaded (Fig. 4b). The reconstructedglycolysis was exported in standardized SBML format that was

J. Schneider et al. / Journal of Biotechnology 155 (2011) 118– 126 123

F genomi he refG

msmp

Fso

ig. 3. Occurring scenarios for the prediction of coding sequences in the reference

mport. (b) Genes predicted by Augustus and Genemark that are not annotated in tene predictions spanning several reference genes.

anually curated and visualized using the SBML editor CellDe-igner (Kitano et al., 2005) to obtain a clear overview of theetabolic capabilities of M. guilliermondii within this metabolic

athway (Fig. 4c).

ig. 4. Reconstruction of the glycolysis pathway of M. guilliermondii using the RAPYD. Thtart page. (a) After pathway selection the reconstruction process can be executed resultinn a generated SVG image. (c) The SBML Level 2 Version 1 is adjusted to the CellDesigner

e M. guilliermondii. (a) Overlapping genes based on the M. guilliermondii Genbankerence genome. (c) Reference genes not predicted by Augustus and Genemark. (d)

3.3. Comparative genomics of yeasts

To gain insights into organism-specific features of M. guillier-mondii ATCC 6260, a comparative genome analysis was performed.

e link ‘Pathway reconstruction’ opens a new window displaying the reconstructiong in (b) the creation of various SBML output formats and a pathway preview based

software and was used for manual curation.

124 J. Schneider et al. / Journal of Biotechnology 155 (2011) 118– 126

Fig. 5. The comparative yeast genomics module enables the computation of shared and divergent gene sets for P. pastoris, P. stipitis and M. guilliermondii and provides variousvisualization features. (a) Screenshot of the core genome computation. (b) Venn diagram of P. pastoris, P. stipitis and M. guilliermondii genes. (c) Reconstruction and manualcuration of the pentose phosphate pathway of M. guilliermondii. (d) Mapping of core genome genes of all three yeasts computed by the comparative genomics module onto ther nome( tabolo sion o

TtRg

do1TgM(

Plxgn

eference pathway of M. guilliermondii. The majority of genes are part of the core geC00199) is not conserved and lacking in P. pastoris. Missing genes and associated mef the references to color in this figure legend, the reader is referred to the web ver

herefore, the complete genome sequences of P. pastoris and P. stipi-is were imported into the RAPYD. The individual options of thisAPYD module can be accessed via the menu entry ‘Comparativeenomics’ in the RAPYD main window.

Computational results of this analysis are displayed in a Venniagram, giving (i) the singletons of each organism, (ii) the homol-gous genes of two species each, as well as (iii) the number of810 genes representing the core genome of all three species.his comparison clearly shows the close relationship between M.uilliermondii and P. stipitis with 1367 homologous genes, while. guilliermondii and P. pastoris only have 218 genes in common

Fig. 5b).Similarities within the gene repertoire of M. guilliermondii and

. stipitis can also be linked to the metabolic pathways of the ana-yzed organisms. In this regard, one example is the production ofylitol. While the catalyzing genes are shared by P. stipitis and M.uilliermondii the core genome also considering P. pastoris doesot include these genes. Mapping this core genome to a metabolic

, but the branch for xylitol (C00379) production starting at d-ribulose 5-phosphateites are colored grey. Core genome genes are highlighted orange. (For interpretationf the article.)

reference pathway of M. guilliermondii additionally visualized thisfinding. Using the metabolic reconstruction module of RAPYD, thepentose phosphate pathway (KEGG map 00030) was generated andsubsequently curated and extended by the xylitol synthesis path-way (origin KEGG map 00040) using the CellDesigner software(Fig. 5c). Missing genes, e.g. the xylitol dehydrogenase and asso-ciated metabolites are colored in grey in contrast to the orangehighlighted core genome genes (Fig. 5d). Thus, M. guilliermondii andP. stipitis are suggested to be able to synthesize xylitol, while P. pas-toris is lacking this metabolic ability. This finding is consistent withthe successful cloning and expression of a xylose reductase gene ofM. guilliermondii in P. pastoris (Handumrongkul et al., 1998).

4. Conclusions

With RAPYD, we developed a fast and easy-to-use softwareplatform that enables in-depth annotation of lower eukaryotes.Furthermore, RAPYD provides fully integrated components for

Biotech

cmblflptggeguf

A

Gsep

R

A

B

B

B

B

B

C

C

C

D

D

D

F

F

F

J. Schneider et al. / Journal of

omparative genomics and metabolic pathway reconstruction. Aajor advantage of RAPYD is its use not only on finished genomes,

ut also on draft genomes after automated assembly. The modu-ar software design is well suited for yeast genome data and offersexibility as well as extensibility. For instance, to overcome theroblem of poorly conceived generalized eukaryotic gene predic-ion programs, the RAPYD enables the integration of individualene prediction strategies. The analysis of the M. guilliermondiienome data highlights the usefulness of this annotation platform,specially for the analysis of unfinished genomes. As more and moreenomes remain in draft status nowadays, this platform might beseful to a wide range of other yeast sequencing projects in theuture.

cknowledgements

J.S. acknowledges the receipt of a scholarship from the CLIBraduate Cluster Industrial Biotechnology. H.N. thanks for financialupport by SysLogics (grant 0315275A). J.B., B.L. and S.J. acknowl-dge financial support of the BMBF within the GenoMik-Transferrogram (grant 0315599B).

eferences

ltermann, E., Klaenhammer, T.R., 2003. GAMOLA: a new local solution for sequenceannotation and analyzing draft and finished prokaryotic genomes. Omics 7,161–169.

endtsen, J.D., Nielsen, H., von Heijne, G., Brunak, S., 2004. Improved prediction ofsignal peptides: SignalP 3.0. J. Mol. Biol. 340, 783–795.

lom, J., Albaum, S.P., Doppmeier, D., Pühler, A., Vorhölter, F.J., Zakrzewski, M., Goes-mann, A., 2009. EDGAR: a software framework for the comparative analysis ofprokaryotic genomes. BMC Bioinformatics 10, 154.

outet, E., Lieberherr, D., Tognolli, M., Schneider, M., Bairoch, A., 2007.UniProtKB/Swiss-Prot. Methods Mol. Biol. 406, 89–112.

ryson, K., Loux, V., Bossy, R., Nicolas, P., Chaillou, S., van de Guchte, M., Penaud, S.,Maguin, E., Hoebeke, M., Bessieres, P., Gibrat, J.F., 2006. AGMIAL: implementingan annotation strategy for prokaryote genomes as a distributed system. NucleicAcids Res. 34, 3533–3545.

utler, G., Rasmussen, M.D., Lin, M.F., Santos, M.A., Sakthikumar, S., Munro, C.A.,Rheinbay, E., Grabherr, M., Forche, A., Reedy, J.L., Agrafioti, I., Arnaud, M.B., Bates,S., Brown, A.J., Brunke, S., Costanzo, M.C., Fitzpatrick, D.A., de Groot, P.W., Harris,D., Hoyer, L.L., Hube, B., Klis, F.M., Kodira, C., Lennard, N., Logue, M.E., Martin, R.,Neiman, A.M., Nikolaou, E., Quail, M.A., Quinn, J., Santos, M.C., Schmitzberger,F.F., Sherlock, G., Shah, P., Silverstein, K.A., Skrzypek, M.S., Soll, D., Staggs, R.,Stansfield, I., Stumpf, M.P., Sudbery, P.E., Srikantha, T., Zeng, Q., Berman, J., Ber-riman, M., Heitman, J., Gow, N.A., Lorenz, M.C., Birren, B.W., Kellis, M., Cuomo,C.A., 2009. Evolution of pathogenicity and sexual reproduction in eight Candidagenomes. Nature 459, 657–662.

aspi, R., Altman, T., Dale, J.M., Dreher, K., Fulcher, C.A., Gilham, F., Kaipa, P.,Karthikeyan, A.S., Kothari, A., Krummenacker, M., Latendresse, M., Mueller, L.A.,Paley, S., Popescu, L., Pujar, A., Shearer, A.G., Zhang, P., Karp, P.D., 2010. The Meta-Cyc database of metabolic pathways and enzymes and the BioCyc collection ofpathway/genome databases. Nucleic Acids Res. 38, D473–D479.

haudhuri, R.R., Loman, N.J., Snyder, L.A., Bailey, C.M., Stekel, D.J., Pallen, M.J., 2008.xBASE2: a comprehensive resource for comparative bacterial genomics. NucleicAcids Res. 36, D543–D546.

onesa, C., Calvo, M., Sanchez, L., 2010. Recombinant human lactoferrin: a valuableprotein for pharmaceutical products and functional foods. Biotechnol. Adv..

e Schutter, K., Lin, Y.C., Tiels, P., Van Hecke, A., Glinka, S., Weber-Lehmann, J., Rouze,P., Van de Peer, Y., Callewaert, N., 2009. Genome sequence of the recombinantprotein production host Pichia pastoris. Nat. Biotechnol. 27, 561–566.

e Vos, M.M., Cuenca-Estrella, M., Boekhout, T., Theelen, B., Matthijs, N., Bauters,T., Nailis, H., Dhont, M.A., Rodriguez-Tudela, J.L., Nelis, H.J., 2005. Vulvovaginalcandidiasis in a Flemish patient population. Clin. Microbiol. Infect. 11, 1005–1011.

wight, S.S., Harris, M.A., Dolinski, K., Ball, C.A., Binkley, G., Christie, K.R., Fisk, D.G.,Issel-Tarver, L., Schroeder, M., Sherlock, G., Sethuraman, A., Weng, S., Botstein,D., Cherry, J.M., 2002. Saccharomyces Genome Database (SGD) provides sec-ondary gene annotation using the Gene Ontology (GO). Nucleic Acids Res. 30, 69–72.

ernandez-Suarez, X.M., Schuster, M.K., 2010. Using the ensembl genome server tobrowse genomic sequence data. In: Current Protocols in Bioinformatics (Chapter1, Unit 1.15).

inn, R.D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J.E., Gavin, O.L.,Gunasekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E.L., Eddy, S.R.,Bateman, A., 2010. The Pfam protein families database. Nucleic Acids Res. 38,D211–D222.

razer, K.A., Pachter, L., Poliakov, A., Rubin, E.M., Dubchak, I., 2004. VISTA: compu-tational tools for comparative genomics. Nucleic Acids Res. 32, W273–W279.

nology 155 (2011) 118– 126 125

Funahashi, A., Jouraku, A., Kitano, H., 2004. Converting KEGG pathway database toSBML. In: Proceedings of the 8th Annual International Conference on Researchin Computational Molecular Biology (RECOMB 2004).

Grossetete, S., Labedan, B., Lespinet, O., 2010. FUNGIpath: a tool to assess fungalmetabolic pathways predicted by orthology. BMC Genomics 11, 81.

Haft, D.H., Selengut, J.D., White, O., 2003. The TIGRFAMs database of protein families.Nucleic Acids Res. 31, 371–373.

Handumrongkul, C., Ma, D.P., Silva, J.L., 1998. Cloning and expression of Candidaguilliermondii xylose reductase gene (xyl1) in Pichia pastoris. Appl. Microbiol.Biotechnol. 49, 399–404.

Hedeler, C., Wong, H.M., Cornell, M.J., Alam, I., Soanes, D.M., Rattray, M., Hubbard,S.J., Talbot, N.J., Oliver, S.G., Paton, N.W., 2007. e-Fungi: a data resource for com-parative analysis of fungal genomes. BMC Genomics 8, 426.

Hucka, M., Finney, A., Sauro, H.M., Bolouri, H., Doyle, J.C., Kitano, H., Arkin, A.P., Born-stein, B.J., Bray, D., Cornish-Bowden, A., Cuellar, A.A., Dronov, S., Gilles, E.D.,Ginkel, M., Gor, V., Goryanin, I.I., Hedley, W.J., Hodgman, T.C., Hofmeyr, J.H.,Hunter, P.J., Juty, N.S., Kasberger, J.L., Kremling, A., Kummer, U., Le Novere, N.,Loew, L.M., Lucio, D., Mendes, P., Minch, E., Mjolsness, E.D., Nakayama, Y., Nelson,M.R., Nielsen, P.F., Sakurada, T., Schaff, J.C., Shapiro, B.E., Shimizu, T.S., Spence,H.D., Stelling, J., Takahashi, K., Tomita, M., Wagner, J., Wang, J., 2003. The systemsbiology markup language (SBML): a medium for representation and exchangeof biochemical network models. Bioinformatics 19, 524–531.

Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P.,Das, U., Daugherty, L., Duquenne, L., Finn, R.D., Gough, J., Haft, D., Hulo, N., Kahn,D., Kelly, E., Laugraud, A., Letunic, I., Lonsdale, D., Lopez, R., Madera, M., Maslen, J.,McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Mulder, N., Natale, D., Orengo,C., Quinn, A.F., Selengut, J.D., Sigrist, C.J., Thimma, M., Thomas, P.D., Valentin, F.,Wilson, D., Wu, C.H., Yeats, C., 2009. InterPro: the integrative protein signaturedatabase. Nucleic Acids Res. 37, D211–D215.

Jeffries, T.W., Grigoriev, I.V., Grimwood, J., Laplaza, J.M., Aerts, A., Salamov, A.,Schmutz, J., Lindquist, E., Dehal, P., Shapiro, H., Jin, Y.S., Passoth, V., Richardson,P.M., 2007. Genome sequence of the lignocellulose-bioconverting and xylose-fermenting yeast Pichia stipitis. Nat. Biotechnol. 25, 319–326.

Jung, K., Park, J., Choi, J., Park, B., Kim, S., Ahn, K., Choi, J., Choi, D., Kang, S., Lee,Y.H., 2008. SNUGB: a versatile genome browser supporting comparative andfunctional fungal genomics. BMC Genomics 9, 586.

Kaur, P., Singh, B., Boer, E., Straube, N., Piontek, M., Satyanarayana, T., Kunze, G., 2010.Pphy-A cell-bound phytase from the yeast Pichia anomala: molecular cloning ofthe gene PPHY and characterization of the recombinant enzyme. J. Biotechnol.149, 8–15.

Kitano, H., Funahashi, A., Matsuoka, Y., Oda, K., 2005. Using process diagrams for thegraphical representation of biological networks. Nat. Biotechnol. 23, 961–966.

Klamt, S., Saez-Rodriguez, J., Gilles, E.D., 2007. Structural and functional analysis ofcellular networks with CellNetAnalyzer. BMC Syst. Biol. 1, 2.

Krieger, C.J., Zhang, P., Mueller, L.A., Wang, A., Paley, S., Arnaud, M., Pick, J., Rhee, S.Y.,Karp, P.D., 2004. MetaCyc: a multiorganism database of metabolic pathways andenzymes. Nucleic Acids Res. 32, D438–D442.

Kurtzman, C.P., Suzuki, M., 2010. Phylogenetic analysis of ascomycete yeasts thatform coenzyme Q-9 and the proposal of the new genera Babjeviella, Meyerozyma,Millerozyma, Priceomyces, and Scheffersomyces. Mycoscience 51, 2–14.

Lagesen, K., Hallin, P., Rodland, E.A., Staerfeldt, H.H., Rognes, T., Ussery, D.W., 2007.RNAmmer: consistent and rapid annotation of ribosomal RNA genes. NucleicAcids Res. 35, 3100–3108.

Laslett, D., Canback, B., 2004. ARAGORN, a program to detect tRNA genes and tmRNAgenes in nucleotide sequences. Nucleic Acids Res. 32, 11–16.

Marchler-Bauer, A., Anderson, J.B., Chitsaz, F., Derbyshire, M.K., DeWeese-Scott, C.,Fong, J.H., Geer, L.Y., Geer, R.C., Gonzales, N.R., Gwadz, M., He, S., Hurwitz, D.I.,Jackson, J.D., Ke, Z., Lanczycki, C.J., Liebert, C.A., Liu, C., Lu, F., Lu, S., Marchler,G.H., Mullokandov, M., Song, J.S., Tasneem, A., Thanki, N., Yamashita, R.A., Zhang,D., Zhang, N., Bryant, S.H., 2009. CDD: specific functional annotation with theConserved Domain Database. Nucleic Acids Res. 37, D205–D210.

Martinez-Guerrero, C.E., Ciria, R., Abreu-Goodger, C., Moreno-Hagelsieb, G., Merino,E., 2008. GeConT 2: gene context analysis for orthologous proteins, conserveddomains and metabolic pathways. Nucleic Acids Res. 36, W176–W180.

Mewes, H.W., Dietmann, S., Frishman, D., Gregory, R., Mannhaupt, G., Mayer, K.F.,Munsterkotter, M., Ruepp, A., Spannagl, M., Stumpflen, V., Rattei, T., 2008. MIPS:analysis and annotation of genome information in 2007. Nucleic Acids Res. 36,D196–D201.

Meyer, F., Goesmann, A., McHardy, A.C., Bartels, D., Bekel, T., Clausen, J., Kalinowski, J.,Linke, B., Rupp, O., Giegerich, R., Pühler, A., 2003. GenDB—an open source genomeannotation system for prokaryote genomes. Nucleic Acids Res. 31, 2187–2195.

Moutselos, K., Kanaris, I., Chatziioannou, A., Maglogiannis, I., Kolisis, F.N., 2009. KEG-Gconverter: a tool for the in-silico modelling of metabolic networks of the KEGGPathways database. BMC Bioinformatics 10, 324.

Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H., Kanehisa, M., 1999. KEGG: KyotoEncyclopedia of Genes and Genomes. Nucleic Acids Res. 27, 29–34.

Park, J., Park, B., Jung, K., Jang, S., Yu, K., Choi, J., Kong, S., Park, J., Kim, S., Kim, H.,Kim, S., Kim, J.F., Blair, J.E., Lee, K., Kang, S., Lee, Y.H., 2008. CFGP: a web-based,comparative fungal genomics platform. Nucleic Acids Res. 36, D562–D571.

Parra, G., Blanco, E., Guigo, R., 2000. GeneID in Drosophila. Genome Res. 10, 511–515.

Peterson, J.D., Umayam, L.A., Dickinson, T., Hickey, E.K., White, O., 2001. The com-prehensive microbial resource. Nucleic Acids Res. 29, 123–125.

Rice, P., Longden, I., Bleasby, A., 2000. EMBOSS: the European Molecular BiologyOpen Software Suite. Trends Genet. 16, 276–277.

1 Biotec

R

S

S

S

S

S

S

S

26 J. Schneider et al. / Journal of

ossouw, D., van den Dool, A.H., Jacobson, D., Bauer, F.F., 2010. Comparativetranscriptomic and proteomic profiling of industrial wine yeast strains. Appl.Environ. Microbiol. 76, 3911–3923.

alzberg, S.L., Pertea, M., Delcher, A.L., Gardner, M.J., Tettelin, H., 1999. InterpolatedMarkov models for eukaryotic gene finding. Genomics 59, 24–31.

atyanarayana, T., Kunze, G., 2009. Yeast Biotechnology: Diversity and Applications.Springer.

chneider, J., Vorhölter, F.J., Trost, E., Blom, J., Musa, Y.R., Neuweger, H., Schatschnei-der, S., Tauch, A., Goesmann, A., 2010. CARMEN — Comparative Analysis andReconstruction of MEtabolic Networks: in silico reconstruction of organism-specific metabolic networks using SBML. Genet. Mol. Res. 9, 1660–1672.

krzypek, M.S., Arnaud, M.B., Costanzo, M.C., Inglis, D.O., Shah, P., Binkley, G.,Miyasato, S.R., Sherlock, G., 2010. New tools at the Candida Genome Database:biochemical pathways and full-text literature search. Nucleic Acids Res. 38,D428–D432.

tanke, M., Schoffmann, O., Morgenstern, B., Waack, S., 2006. Gene prediction ineukaryotes with a generalized hidden Markov model that uses hints from exter-nal sources. BMC Bioinformatics 7, 62.

tanke, M., Waack, S., 2003. Gene prediction with a hidden Markov model and a newintron submodel. Bioinformatics 19 (Suppl. 2), ii215–ii225.

uwannarangsee, S., Oh, D.B., Seo, J.W., Kim, C.H., Rhee, S.K., Kang, H.A., Chulalak-sananukul, W., Kwon, O., 2010. Characterization of alcohol dehydrogenase 1 ofthe thermotolerant methylotrophic yeast Hansenula polymorpha. Appl. Micro-biol. Biotechnol. 88, 497–507.

hnology 155 (2011) 118– 126

Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., Koonin, E.V.,Krylov, D.M., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., Rao, B.S., Smirnov,S., Sverdlov, A.V., Vasudevan, S., Wolf, Y.I., Yin, J.J., Natale, D.A., 2003. The COGdatabase: an updated version includes eukaryotes. BMC Bioinformatics 4, 41.

Ter-Hovhannisyan, V., Lomsadze, A., Chernoff, Y.O., Borodovsky, M., 2008. Gene pre-diction in novel fungal genomes using an ab initio algorithm with unsupervisedtraining. Genome Res. 18, 1979–1990.

Tomas-Pejo, E., Ballesteros, M., Oliva, J.M., Olsson, L., 2010. Adaptation of the xylosefermenting yeast Saccharomyces cerevisiae F12 for improving ethanol productionin different fed-batch SSF processes. J. Ind. Microbiol. Biotechnol. 37, 1211–1220.

Uchiyama, I., 2003. MBGD: microbial genome database for comparative analysis.Nucleic Acids Res. 31, 58–62.

Walter, M.C., Rattei, T., Arnold, R., Guldener, U., Munsterkotter, M., Nenova, K., Kas-tenmuller, G., Tischler, P., Wolling, A., Volz, A., Pongratz, N., Jost, R., Mewes, H.W.,Frishman, D., 2009. PEDANT covers all complete RefSeq genomes. Nucleic AcidsRes. 37, D408–D411.

Wang, A., Wang, S., Shen, M., Chen, F., Zou, Z., Ran, X., Cheng, T., Su, Y., Wang, J.,2009. High level expression and purification of bioactive human alpha-defensin
5 mature peptide in Pichia pastoris. Appl. Microbiol. Biotechnol. 84, 877–884.
Yeh, L.T., Charles, A.L., Ho, C.T., Huang, T.C., 2009. A novel bread making process usingsalt-stressed Baker’s yeast. J. Food Sci. 74, S399–S402.

Zou, Y.Z., Qi, K., Chen, X., Miao, X.L., Zhong, J.J., 2010. Favorable effect of very lowinitial K(L)a value on xylitol production from xylose by a self-isolated strain ofPichia guilliermondii. J. Biosci. Bioeng. 109, 149–152.

Documents

RAPYD — Rapid Annotation Platform for Yeast Data