5
377 The most important advances in the field of genome annotation over the past two years involve the use of cDNA sequences, protein structures and gene expression data to predict genes. These types of information not only improve gene identification, but they also give insights into variation in gene structure and function. Addresses Laboratory of Computational Genomics, 1230 York Avenue, Rockefeller University, New York, New York 10021, USA *e-mail: [email protected] e-mail: [email protected] Current Opinion in Structural Biology 2001, 11:377–381 0959-440X/01/$ — see front matter Published by Elsevier Science Ltd. Abbreviations BAC bacterial artificial chromosome EST expressed sequence tag Introduction As new genome sequence data sets are deposited in public databases and updated, initial whole-genome analyses must be updated in turn. Updated analysis applies to both the gene-finding phase and the function-assignment phase of genome annotation. The annotation update problem becomes even more compelling now that two draft human genome sequences have been reported [1 •• ,2 •• ]. These ini- tial human genome annotations identified 39,114 and 31,778 proteins, respectively. Three recent analyses of the human sequences in the dbEST, Unigene and The Institute of Genomic Research (TIGR) Gene Index data- bases of expressed sequence tags (ESTs) indicate that somewhere on the order of 30% of genes have evidence of alternative splice forms [3,4,5 ], each producing a different form of a protein or a differently regulated protein. The RIKEN Institute in Tsukuba, Japan, together with the FANTOM Consortium, recently reported 21,076 contiguous full-length sequences for cDNA clones from mouse [6 •• ]. The sequences contain 15,295 distinct sequence clusters, each corresponding to a putative gene. The growing EST and cDNA data sets complement not only the human genomes but also the complete genomes of four nonmam- malian eukaryotic organisms — Caenorhabditidis elegans, Saccharomyces cerevisiae, Drosophila melanogaster [7 •• ] and Arabidopsis thaliana [8 •• ] — as well as the complete or near- ly complete genomes of more than 55 bacterial and archaeal organisms. Underway are 323 genomes in various phases of completion (http://igweb.integratedgenomics.com/GOLD). Furthermore, several funded consortia for high-throughput protein structure determination are generating new struc- tures that will enable the construction of 3D protein structure models for increasing numbers of genes in all of the genomes. Finally, more and more groups are generating high-throughput microarray gene expression data that yield insights into functional dependencies among genes and gene and protein networks. Over time, careful com- parison and integration of these data sets will lead to updated and refined protein-coding regions with computa- tionally assigned functions. This review explores the implications of integrating each forthcoming type of sequence and structure data with the annotation of com- pletely sequenced genomes. Initial annotation of the human genome sequence data In February 2001, two versions of a draft human genome were reported [1 •• ,2 •• ]. The two human genome sequencing projects used radically different sequencing methods. One team used whole genome shotgun [1 •• ]. The other team sequenced one bacterial artificial chromosome (BAC) at a time [2 •• ]. Both versions of the genome produced multiple contiguous sequences after assembly. The primary differ- ence is the degree to which unconnected assembled sequences are properly ordered and oriented with respect to one another. Some chromosomes from the BAC approach are completely finished; others are partially cov- ered by unordered or unoriented contigs of varying size [2 •• ]. The whole-genome shotgun approach used paired reads from both ends of 50,000 base pair genomic inserts to impose order and orientation across contigs [1 •• ]. The final orientation and ordering of contigs will have a profound impact on both DNA content and gene structure of the annotated genes. Any genes that span two or more contigs will be identified computationally in multiple pieces. If the underlying contigs are ordered, the genes can be pieced together. Otherwise, the annotation process will maintain the gene fragments as multiple gene predictions. To anno- tate gene structures, both projects used the public cDNA, EST and protein sequence data sets, but applied quite dis- tinct methods to identify genes within the two genomes. Both approaches started with known human genes and mapped them to the assembled genomic sequence. The Celera approach [1 •• ] identified 6538 previously known human genes. The Celera method used sequence homology search algorithms such as BLAST and sequence threading algorithms such as SIM4 to map human, mouse and rat ESTs and known proteins to the genome. In addition, they mapped private mouse sequence data to the human genome to identify regions that matched at 85% identity or higher. Conserved sequence between mouse and human tends to correspond to exons and promoters, and some- times introns [9]. Contiguous genomic sequences that matched proteins, mRNAs, cDNA ESTs or unassembled genomic mouse sequences were extracted and masked. Each human contig base that matched no other sequence was replaced with an ‘N’. Gene prediction tools were then Whole-genome analysis: annotations and updates Terry Gaasterland* and Mihaela Oprea

Whole-genome analysis: annotations and updates

Embed Size (px)

Citation preview

377

The most important advances in the field of genomeannotation over the past two years involve the use of cDNAsequences, protein structures and gene expression data topredict genes. These types of information not only improvegene identification, but they also give insights into variation ingene structure and function.

AddressesLaboratory of Computational Genomics, 1230 York Avenue,Rockefeller University, New York, New York 10021, USA*e-mail: [email protected]†e-mail: [email protected]

Current Opinion in Structural Biology 2001, 11:377–381

0959-440X/01/$ — see front matterPublished by Elsevier Science Ltd.

AbbreviationsBAC bacterial artificial chromosomeEST expressed sequence tag

IntroductionAs new genome sequence data sets are deposited in publicdatabases and updated, initial whole-genome analyses mustbe updated in turn. Updated analysis applies to both thegene-finding phase and the function-assignment phase ofgenome annotation. The annotation update problembecomes even more compelling now that two draft humangenome sequences have been reported [1••,2••]. These ini-tial human genome annotations identified 39,114 and31,778 proteins, respectively. Three recent analyses of thehuman sequences in the dbEST, Unigene and TheInstitute of Genomic Research (TIGR) Gene Index data-bases of expressed sequence tags (ESTs) indicate thatsomewhere on the order of 30% of genes have evidence ofalternative splice forms [3,4,5•], each producing a differentform of a protein or a differently regulated protein. TheRIKEN Institute in Tsukuba, Japan, together with theFANTOM Consortium, recently reported 21,076 contiguousfull-length sequences for cDNA clones from mouse [6••].The sequences contain 15,295 distinct sequence clusters,each corresponding to a putative gene. The growing ESTand cDNA data sets complement not only the humangenomes but also the complete genomes of four nonmam-malian eukaryotic organisms — Caenorhabditidis elegans,Saccharomyces cerevisiae, Drosophila melanogaster [7••] andArabidopsis thaliana [8••] — as well as the complete or near-ly complete genomes of more than 55 bacterial and archaealorganisms. Underway are 323 genomes in various phases ofcompletion (http://igweb.integratedgenomics.com/GOLD).Furthermore, several funded consortia for high-throughputprotein structure determination are generating new struc-tures that will enable the construction of 3D proteinstructure models for increasing numbers of genes in all ofthe genomes. Finally, more and more groups are generating

high-throughput microarray gene expression data thatyield insights into functional dependencies among genesand gene and protein networks. Over time, careful com-parison and integration of these data sets will lead toupdated and refined protein-coding regions with computa-tionally assigned functions. This review explores theimplications of integrating each forthcoming type ofsequence and structure data with the annotation of com-pletely sequenced genomes.

Initial annotation of the human genomesequence dataIn February 2001, two versions of a draft human genomewere reported [1••,2••]. The two human genome sequencingprojects used radically different sequencing methods. Oneteam used whole genome shotgun [1••]. The other teamsequenced one bacterial artificial chromosome (BAC) at atime [2••]. Both versions of the genome produced multiplecontiguous sequences after assembly. The primary differ-ence is the degree to which unconnected assembledsequences are properly ordered and oriented with respectto one another. Some chromosomes from the BACapproach are completely finished; others are partially cov-ered by unordered or unoriented contigs of varying size[2••]. The whole-genome shotgun approach used pairedreads from both ends of 50,000 base pair genomic inserts toimpose order and orientation across contigs [1••]. The finalorientation and ordering of contigs will have a profoundimpact on both DNA content and gene structure of theannotated genes. Any genes that span two or more contigswill be identified computationally in multiple pieces. If theunderlying contigs are ordered, the genes can be piecedtogether. Otherwise, the annotation process will maintainthe gene fragments as multiple gene predictions. To anno-tate gene structures, both projects used the public cDNA,EST and protein sequence data sets, but applied quite dis-tinct methods to identify genes within the two genomes.

Both approaches started with known human genes andmapped them to the assembled genomic sequence. TheCelera approach [1••] identified 6538 previously knownhuman genes. The Celera method used sequence homologysearch algorithms such as BLAST and sequence threadingalgorithms such as SIM4 to map human, mouse and ratESTs and known proteins to the genome. In addition, theymapped private mouse sequence data to the humangenome to identify regions that matched at 85% identity orhigher. Conserved sequence between mouse and humantends to correspond to exons and promoters, and some-times introns [9]. Contiguous genomic sequences thatmatched proteins, mRNAs, cDNA ESTs or unassembledgenomic mouse sequences were extracted and masked.Each human contig base that matched no other sequencewas replaced with an ‘N’. Gene prediction tools were then

Whole-genome analysis: annotations and updatesTerry Gaasterland* and Mihaela Oprea†

applied to refine the resulting gene models. Each resultingexon was assessed for depth of evidence — how manysequences of what type aligned with each exon. Thisapproach, called Otto, predicted 11,226 genes. Three geneprediction tools, Genscan [10], Grail [11] and FGENESH,produced an additional 21,350 genes with sequence simi-larity evidence, of which 8619 genes had at least two typesof evidence. Using the smaller set of 8,619 genes, theCelera human genome annotation contains 26,383 genes.With the larger number of 21,350, it contains 39,114 genes.

Ensembl also started by mapping known human genes tothe draft genome. They then mapped known genes fromrelated organisms and known complete proteins to the draftgenome. This resulted in 14,882 genes with direct evidencefor a full-length gene. Next, the group processed predictedgenes from two software tools, Genscan [10] and Genie [12].Overlapping predicted genes from the two tools weremerged and individual exons were validated by comparisonwith EST sequences. The merged set produced 4057 genesand Genscan alone produced an additional 12,839 geneswith supporting EST evidence, for a total of 31,778 anno-tated genes in the Ensembl human genome annotation.

At the time of writing, 26,544 Celera-annotated proteinsand 29,304 Ensembl-annotated proteins were freely avail-able to academic researchers (via http://www.celera.comand http://www.ensembl.org). An additional 12,712 pre-dicted proteins, giving a total of 39,356, are available fromCelera. Subsequent comparison of the DNA transcripts for26,544 Celera proteins and 29,304 Ensembl proteins indi-cate that approximately 7000 (26%) of theCelera-annotated genes have no BLASTN sequence simi-larity matches at all (using default parameters and withoutmasking for low information content) with the Ensemblset. Approximately 3600 (12%) of the Ensembl genes haveno matches with the Celera set; leaving 74% of the Celeragenes and 88% of the Ensembl genes with complete orpartial matches across the two sets.

Initial annotation of 21,000 mouse full-lengthcDNAsIn February 2001, the RIKEN Institute in Japan reportedannotated sequences for 21,076 full-length cDNA clones[6••]. The sequences were enriched for novel genes by firstsequencing 3′ ends of clones and then fully sequencingonly clones with novel 3′ ends. The 21,000 clones reducedto 15,295 clusters, each representing a unique gene. Aswith EST clustering, the number of clusters can varyaccording to sequence comparison parameters and strin-gency of requirements for inclusion of two sequences inthe same cluster. Only 2390 (16%) of the RIKEN clustersmatched known mouse genes. An additional 4114 (27%)clusters matched known proteins from other organisms,573 (3%) contained protein function motifs only and 1634(11%) matched other proteins with no known function.The remainder was unique to mouse (before the drafthuman genome was released).

The public genome project assessed the completeness oftheir annotation by assessing how frequently RIKENclones matched the genome sequence, but did not matchpredicted genes [2••]. 81% of the RIKEN clone clustersmatched the genome. Only 69% of the clones alsomatched predicted genes. The resulting sensitivity of85% (69/81) indicates that a more complete publicsequence annotation could increase by 15% to cover36,500 genes.

Functional annotation through protein families When gene structures have been identified, the next anno-tation step for any genome sequence project is to assignbiochemical and physiological function to each gene. Aninitial functional screen can be done through a protein-level pairwise sequence comparison of annotated geneswith proteins of known function. This approach, however,tends to assign function even when critical amino acidresidues are missing from the annotated protein. Although,in general, conservation of sequence implies conservationof structure and conserved structure implies conservedfunction, key detailed changes in amino acids can alterfunction significantly. To address this, databases have beenbuilt to capture conserved protein sequence motifs associ-ated with function. In particular, the PFAM (ProteinFamily Database) [13] and PRINTS (Protein FingerprintsDatabase) [14] databases of protein functional motifs clas-sify protein sequence domains into functional categoriesand then use multiple sequence examples to extractsequence patterns that uniquely identify the function.When a putative protein matches a pattern, it is likely tohave the function associated with the pattern. WhereasPFAM tries to capture patterns that identify general func-tional domains, PRINTS seeks to identify patterns thatsubclassify different types of proteins within a functionalclass. Together, the two resources are an invaluable sourceof information for protein annotation. The InterPro project[15•] is designed to bring all protein motif patterns into onedatabase and search system for annotating new proteinsequences. The Gene Ontology consortium [16•] is seeking to create a standardized, comprehensive, multi-organism vocabulary to describe gene functions. TheMAGPIE automated genome and clone annotation system[17] combines sequence similarity evidence, genomicproximity to genes of known function, correlated geneexpression levels and protein functional motif patterns toassign putative function to genomic proteins.

Schultz et al. [18] used known signal peptide domainsstored in a database of functional protein structure mod-ules called SMART to search exhaustively in the ESTdatabases for sequences that contained signal peptideleader sequences and matched SMART domain motifs.They used known domains from 100 signal domain fami-lies to find 4206 ESTs that corresponded to over 1000potential human proteins. This provides a rich data set toannotate signal proteins quickly and effectively in the drafthuman genomes.

378 Sequences and topology

Functional annotation through prokaryoticoperons For prokaryotic organisms, clusters of genes that are closeenough to be either co-transcribed in operons, or remnantsof operons, provide information to infer functionally cou-pled genes. Overbeek et al. [19] have correlated closegenomic proximity with the fact that such genes partici-pate in the same metabolic pathway for 30 prokaryoticgenomes. They found a significant amount of correlationbetween genomic proximity and participation in the samemetabolic pathways, and used genomic proximity to inferpossible functional relatedness for genes of unknown func-tion that were putatively co-transcribed with other genesof known function.

Updating functional annotations: a case studyin Escherichia coliAn updated functional annotation of the Escherichia coligenome assigned new functional information to 497 genes[20]. Of these, 432 genes previously had no known func-tion, but had protein sequence homology to proteins fromother organisms. The functional update was based on pro-tein sequence similarity to genes from genomes completedsince 1997 and on literature reviews encoded inGenProtEC (http://genprotec.mbl.edu). The re-annotationprocess included an updated gene prediction that yielded508 additional plausible genes, of which nine were genesthat received new information about their function.

Extending annotation through new expressedsequence tagsGene identification is an ongoing process. Growing datasets of cDNA ESTs provide a rich resource to identify addi-tional genes [21,22•]. The CGAP project continues to addESTs from healthy, precancerous and cancerous human,mouse and rat tissues to the public databases. The growingbody of data was a primary resource for establishing thevalidity of predicted genes in the recently released draftassemblies of the human genome [1••,2••].

Computational and microarray analysis of 3141 Drosophilatestis ESTs [22•] showed that 16% had no sequence simi-larity to known or predicted genes. New biologicalevidence was gained from 31% of the sequences forDrosophila genes predicted through sequence similarity toother organisms. In other work, 458 predicted but previ-ously unannotated genes in the D. melanogaster genomematched CGAP ESTs at the protein level with significantsequence similarity (i.e. better than 10–4) [23•].

Extending annotation through alternativelyspliced genesIt is notable that two independent studies of ESTs for evi-dence of variant gene structures assessed that 10–11% ofgenes have variant spliced transcripts with internalskipped or truncated exons. Three studies [3,4,5•] indicatethat, on average, one-third of genes have EST evidence ofalternative splicing of any sort.

Mironov et al. [3] assessed the EST evidence for splicevariants in a gold standard set of 392 genes established formeasuring the quality of gene prediction tools. They iden-tified the genomic sequence for each gene. They thenmapped ESTs to the genomic sequences. When ESTsspanned introns, they predicted splice sites. Finally, theynoted ESTs that skipped an exon or had alternative donoror acceptor sites compared with other ESTs. In total, theiranalysis showed that 33% of the genes were alternativelyspliced, of which 30% involved internal skipped or truncatedexons. Thus, 10% of the 392 genes had internal variantsplicing. Of the internal variants, 23% had alternativeacceptor sites (2.3% of the 392 genes), 16% had alternativedonor sites (1.6% of 392) and 27% contained missing mid-dle exons (2.7% of 392). The remaining internal variantswere complex mixtures of these three variations.

Brett et al. [4] compared 7867 nonredundant mRNAsequences with ESTs to find ESTs that mapped to themRNAs with long internal gaps in either the mRNA orEST sequence. They found 3011 (38%) of the mRNAshad evidence of 4560 alternative splice forms. Themethodology did not check, however, that the sequence inthe gap could be translated to protein sequence, so some ofthe alternative splice forms may have been a result ofunspliced introns in the ESTs.

To determine which UNIGENE clusters mapped to thesame gene or to different genes, Zhuo et al. [5•] computedconsensus sequences for assembled UNIGENE clustersand mapped each consensus sequence to the draft humangenome. Clustering, assembly, gene extension andgenome mapping processes produced single consensussequences for 61,066 UNIGENE clusters plus 16,305genomic regions mapped to singleton sequences. Whenmapped to the draft genome, each cluster spanned onaverage 24,000–34,000 bases. The key difference betweenthe approach to assembly used by Zhuo et al. [5•] and arelated approach used by Ewing and Green [24], whichproduced 40,584 consensus sequences for EST clusters,was the stringency of assembly and the elimination ofchimeric ESTs from the input database followed by map-ping to the genomic data to confirm whether clustersshould have multiple consensus sequences.

Zhuo et al. [5•] identified 6713 consensus sequences whoseEST clusters contained alternatively spliced sequences.They applied a stringent screen and sought only variantESTs that had missing or truncated internal exons or sub-stitute internal exons with respect to the consensus. Only287 of the 6713 consensus sequences had multiple vari-ants. In total, only 11% of the 61,066 clusters containedevidence of internal alternative splice variants.

Extending annotation through proteinstructure analysisThe complete, partial and draft genomes provide a rich inputdata set for high-throughput protein structure determination

Whole-genome analysis Gaasterland and Oprea 379

projects [25]. As these projects generate new protein struc-tures, protein structure models help to confirm putativeannotated proteins and refine predicted proteins.

A recent re-annotation of the D. melanogaster genome usedremote protein alignments from psiBLAST followed byprotein structure modeling and model assessment [26] tolend additional evidence to 43% of previously annotatedgene structures [23•]. Structure-based validation yieldedcomputational evidence to support the validity of 315 addi-tional predicted plausible gene structures. Thestructure-based evidence plus comparisons of predictedgenes with ESTs and all available genomic proteins fromorganisms yielded a total of 1042 candidate gene structures[23•] beyond the 13,601 original annotated genes [7••].

As EST and full-length cDNA sequence analysis yieldsalternative splice forms for predicted genes, high-qualityprotein structure models [27] will provide new data toassess variants at the 3D structure level.

Extending functional annotation with geneexpression dataMicroarray gene expression studies can be designed toreveal genes and gene expression levels specific to tissuetypes, cell types, developmental phases and cell-cyclephases [28,29,30•]. With a completely finished genomesequence, an array can be populated with oligomericsequences or amplicons from each annotated gene [28,29].With sequenced cDNA libraries, an array can be populatedwith representative cDNA sequences [30•]. Either way, anarray is hybridized with fluorescent- or biotin-labeledmRNA extracted from a cell or tissue sample and scannedwith a laser. The resulting intensity values indicate levelsof gene expression relative to a reference mRNA sample.These measurements provide essential gene annotationdata. First, they answer whether the gene is biologicallyreal. Then, for oligo-based arrays designed to measure theexpression levels of individual exons, they answer whetherthe predicted gene structure is correct. Finally, they exam-ine under what conditions a gene is expressed and inassociation with what other genes. Studies that answerthese questions add a new dimension of information toannotated genomes.

In 1997, DeRisi et al. [28] established a methodology toseek genes that are associated with yeast cell-cycle phases.They discovered over 200 genes that showed cyclic orphased-related expression levels. In 2000, Alter et al. [31•]developed a refined analysis method that uses a standardstatistical technique called singular value decomposition,or principal component analysis, to separate genes thatshare time-dependent regulatory patterns from genes withnoisy variation in expression level.

More recently, in 2001, the RIKEN Institute in Japan pub-lished a study of mouse gene expression levels using arraysspotted with 18,816 cDNAs from the full-length clone

collection [30•]. They hybridized a series of samples from49 mouse developmental stages and tissues, includingembryonic brain, postnatal cerebellum and adult olfactorybulb. Among their findings they observed that cell divisiongenes were expressed at higher levels during developmentthan adulthood, protein biosynthesis genes increased justbefore birth and genes involved in metabolism were gen-erally higher in adult tissues. When they applied theexpression level data to genes encoding proteins in the gly-colytic pathway, they found tissue-specific expressionpatterns for the pathway with different expression profilesfor muscle, testis, liver and kidney. Overall, neighboringgenes in the pathway tended to correlate more closely thangenes distant in the pathway. This observation raises thepossibility of using gene expression levels to dissect notonly signaling pathway relationships between proteins butalso metabolic pathway relationships.

Microarrays designed from a whole genome can be used toinvestigate genome sequence differences in closely relat-ed organisms. In 1999, Behr et al. [29] designed anexpanded series of array studies to investigate genomedeletion differences between Mycobacterium tuberculosis,Mycobacterium bovis and Mycobacterium strains used tomake vaccines for tuberculosis. They found 38 genes pre-sent in M. bovis but absent in the vaccine strains and 91genes present in a virulent tubercular strain but absentfrom the vaccine strains. These observations provide thebasis for improved vaccines and diagnostics. They alsoprovide information to add putative functional dependen-cies to genes in the M. tuberculosis genome annotation. Inrelated work, Wilson et al. [32] show which M. tuberculosisgenes are elevated in gene expression in response to treat-ment with isoniazid; the studies revealed functionallyrelated genes that responded to the toxic effects of thedrug. As with the pathway observations of the RIKENgroup, controlled studies of response to different com-pounds have the potential to delineate functionalpathways encoded in related genes.

ConclusionsWhole-genome annotation is an ongoing process. As moregenomes are sequenced, protein families will gain newmember sequences, split into smaller families and gainrefined patterns of conserved functional motifs.Biochemical function information assigned to one familymember becomes a source of putative functional informa-tion about the other members. As EST and cDNAsequence collections grow, even in the absence of func-tional information, protein families will at least gaininformation about the diversity of the sequences, the rela-tive evolution rates of ortholog families and paralogfamilies, and the degree of expression in different tissuesand in different organisms. Microarray gene expressioninformation connected to genome sequence annotationsprovides additional information about expression condi-tions. Furthermore, well-designed microarrays willvalidate genes and indicate what combinations of exons are

380 Sequences and topology

used in different tissues and alternative combinations ofexons will be confirmed.

References and recommended readingPapers of particular interest, published within the annual period of review,have been highlighted as:

• of special interest•• of outstanding interest

1. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, •• Smith HO, Yandell M, Evans CA, Holt RA et al..: The sequence of the

human genome. Science 2001, 291:1304-1351.This paper reports on the human genome sequence generated throughwhole-genome shotgun sequencing. Contiguous assembled sequenceswere annotated using public sequence data, three gene prediction tools anddraft mouse genome sequence data.

2. International Human Genome Sequencing Consortium: A physical •• map of the human genome. Nature 2001, 409:934-941.This paper reports on the working draft human genome sequence completed to date by the public sequencing consortium. The draft isassembled from complete and partially sequenced BAC clones. Theassembled genome is annotated using the Ensembl computational archi-tecture, which includes the use of public sequence data and two gene prediction tools.

3. Mironov AA, Fickett JW, Gelfand MS: Frequent alternative splicingof human genes. Genome Res 1999, 9:1288-1293.

4. Brett D, Hanke J, Lehmann G, Haase S, Delbruck S, Krueger S, Reich J, Bork P: EST comparison indicates 38% of human mRNAscontain possible alternative splice forms. FEBS Lett 2000, 474:83-86.

5. Zhuo D, Zhao W, Wright F, Yang H, Wang J, Sears R, Baer T, • Do-Hun K, Gordon D, Gibbs S et al.: Assembly, annotation, and

integration of UNIGENE clusters into the human genome draft.Genome Res 2001, 11:904-918.

This paper takes a novel computational approach to the integration of humanESTs into the draft human genome.

6. Kawai J, Hasegawa Y: Functional annotation of a full-length mouse •• cDNA collection. Nature 2001, 409:685-690.This paper reports on the most comprehensive collection to date ofsequences for mouse genes.

7. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, •• Kravitz SA, Mobarry CM, Reinert KH, Remington KA et al.:

A whole-genome assembly of Drosophila. Science 2000,287:2196-2204.

This paper describes the assembly and analysis of the first eukaryoticgenome completed through whole-genome shotgun sequencing.

8. International Sequence and Analysis Consortium: Analysis of the •• genome sequence of the flowering plant Arabidopsis thaliana.

Nature 2000, 408:796-815.This paper unifies the individual Arabidopsis chromosome publications intoa full annotation analysis of the genome.

9. Bafna V, Huson DH: The conserved exon method for gene finding.Proc Int Conf Intell Syst Mol Biol 2000, 8:3-12.

10. Burge C, Karlin S: Prediction of complete gene structures inhuman genomic DNA. J Mol Biol 1997, 268:78-94.

11. Uberbacher EC, Xu Y, Mural RJ: Discovering and understandinggenes in human DNA sequence using GRAIL. Methods Enzymol1996, 266:259-281.

12. Reese MG, Kulp D, Tammana H, Haussler D: Genie — gene findingin Drosophila melanogaster. Genome Res 2000, 10:529-538.

13. Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer EL:The Pfam protein families database. Nucleic Acids Res 2000,28:263-266.

14. Attwood TK, Flower DR, Lewis AP, Mabey JE, Morgan SR, Scordis P,Selley JN, Wright W: PRINTS prepares for the new millennium.Nucleic Acids Res 1999, 27:220-225.

15. Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, • Bucher P, Cerutti L, Corpet F, Croning MD et al.: InterPro — an

integrated documentation resource for protein families, domainsand functional sites. Bioinformatics 2000, 16:1145-1150.

This paper describes the integration of multiple protein motif databases,including PRINTS, PFAM, and PROSITE.

16. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, • Davis AP, Dolinski K, Dwight SS, Eppig JT et al.: Gene ontology: tool

for the unification of biology. The gene ontology consortium. NatGenet 2000, 25:25-29.

This paper describes the computational methods behind the Gene Ontologyconsortium goal to establish a coherent, standardized set of terms fordescribing functions of genes in genomes.

17. Gaasterland T, Sczyrba A, Thomas E, Aytekin-Kurban G, Gordon P,Sensen CW: MAGPIE/EGRET annotation of the 2.9-Mb Drosophilamelanogaster Adh region. Genome Res 2000, 10:502-510.

18. Schultz J, Doerks T, Ponting CP, Copley RR, Bork P: More than 1,000putative new human signaling proteins revealed by EST datamining. Nat Genet 2000, 25:201-204.

19. Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N: The useof gene clusters to infer functional coupling. Proc Natl Acad SciUSA 1999, 96:2896-2901.

20. Serres M, Gopal S, Nahum L, Liang P, Gaasterland T, Riley M:A functional update of the E. coli genome. Genome Biol 2001,in press.

21. Schaefer C, Grouse L, Buetow K, Strausberg RL: A new cancergenome anatomy project web resource for the community.Cancer J 2001, 7:52-60.

22. Andrews J, Bouffard GG, Cheadle C, Lu J, Becker KG, Oliver B: • Gene discovery using computational and microarray analysis of

transcription in the Drosophila melanogaster testis. Genome Res2000, 10:2030-2043.

This paper shows that EST libraries from previously unsampled tissues givebiological evidence for previously unannotated genes in D. melanogaster.

23. Gopal S, Schroeder M, Pieper U, Sczyrba A, Aytekin-Kurban G, • Bekiranov S, Fajardo JE, Eswar N, Sanchez R, Sali A, Gaasterland T:

Homology-based annotation yields 1,042 new candidate genes inthe Drosophila melanogaster genome. Nat Genet 2001, 27:337-340.

This paper shows that the number of predicted genes for D. melanogasterwith protein-level evidence can be extended by protein-level comparison withESTs from all organisms and 3D protein structure homology modeling.

24. Ewing B, Green P: Analysis of expressed sequence tags indicates35,000 human genes. Nat Genet 2000, 25:232-234.

25. Yona G, Linial N, Linial M: ProtoMap: automatic classification ofprotein sequences and hierarchy of protein families. Nucleic AcidsRes 2000, 28:49-55.

26. Sanchez R, Pieper U, Mirkovic N, de Bakker PI, Wittenstein E, Sali A:MODBASE, a database of annotated comparative proteinstructure models. Nucleic Acids Res 2000, 28:250-253.

27. Sanchez R, Sali A: Comparative protein structure modeling.Introduction and practical examples with modeller. Methods MolBiol 2000,143:97-129.

28. DeRisi JL, Iyer VR, Brown PO: Exploring the metabolic and geneticcontrol of gene expression on a genomic scale. Science 1997,278:680-686.

29. Behr MA, Wilson MA, Gill WP, Salamon H, Schoolnik GK, Rane S,Small PM: Comparative genomics of BCG vaccines by whole-genome DNA microarray. Science 1999, 284:1520-1523.

30. Miki R, Kadota K, Bono H, Mizuno Y, Tomaru Y, Carninci P, Itoh M, • Shibata K, Kawai J, Konno H et al.: Delineating developmental and

metabolic pathways in vivo by expression profiling using theRIKEN set of 18,816 full-length enriched mouse cDNA arrays.Proc Natl Acad Sci USA 2001, 98:2199-2204.

This paper uses microarrays of sequenced clones from full-length cDNA libraries to characterize the gene expression levels of mouse genes in49 different tissues. Expression levels are mapped to meta-bolic pathwaysto assess variation of pathway expression in different tissues.

31. Alter O, Brown PO, Botstein D: Singular value decomposition for • genome-wide expression data processing and modeling. Proc

Natl Acad Sci USA 2000, 97:10101-10106.This paper describes the use singular value decomposition to reduce geneexpression data for large numbers of genes and arrays to ‘eigengenes’ and‘eigenarrays’ and thereby filter noise and experimental artifacts from the arrayexperiment interpretation.

32. Wilson M, DeRisi J, Kristensen HH, Imboden P, Rane S, Brown PO,Schoolnik G: Exploring drug-induced alterations in geneexpression in Mycobacterium tuberculosis by microarrayhybridization. Proc Natl Acad Sci USA 1999, 96:12833-12838.

Whole-genome analysis Gaasterland and Oprea 381