50
DOI: 10.1126/science.1058040 , 1304 (2001); 291 Science et al. J. Craig Venter, The Sequence of the Human Genome www.sciencemag.org (this information is current as of August 31, 2009 ): The following resources related to this article are available online at http://www.sciencemag.org/cgi/content/full/sci;295/5559/1466b A correction has been published for this article at: http://www.sciencemag.org/cgi/content/full/291/5507/1304 version of this article at: including high-resolution figures, can be found in the online Updated information and services, http://www.sciencemag.org/cgi/content/full/291/5507/1304/DC1 can be found at: Supporting Online Material found at: can be related to this article A list of selected additional articles on the Science Web sites http://www.sciencemag.org/cgi/content/full/291/5507/1304#related-content http://www.sciencemag.org/cgi/content/full/291/5507/1304#otherarticles , 52 of which can be accessed for free: cites 152 articles This article 4774 article(s) on the ISI Web of Science. cited by This article has been http://www.sciencemag.org/cgi/content/full/291/5507/1304#otherarticles 79 articles hosted by HighWire Press; see: cited by This article has been http://www.sciencemag.org/cgi/collection/genetics Genetics : subject collections This article appears in the following http://www.sciencemag.org/about/permissions.dtl in whole or in part can be found at: this article permission to reproduce of this article or about obtaining reprints Information about obtaining registered trademark of AAAS. is a Science 2001 by the American Association for the Advancement of Science; all rights reserved. The title Copyright American Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by the Science on August 31, 2009 www.sciencemag.org Downloaded from

The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

DOI: 10.1126/science.1058040 , 1304 (2001); 291Science

et al.J. Craig Venter,The Sequence of the Human Genome

www.sciencemag.org (this information is current as of August 31, 2009 ):The following resources related to this article are available online at

http://www.sciencemag.org/cgi/content/full/sci;295/5559/1466b A correction has been published for this article at:

http://www.sciencemag.org/cgi/content/full/291/5507/1304version of this article at:

including high-resolution figures, can be found in the onlineUpdated information and services,

http://www.sciencemag.org/cgi/content/full/291/5507/1304/DC1 can be found at: Supporting Online Material

found at: can berelated to this articleA list of selected additional articles on the Science Web sites

http://www.sciencemag.org/cgi/content/full/291/5507/1304#related-content

http://www.sciencemag.org/cgi/content/full/291/5507/1304#otherarticles, 52 of which can be accessed for free: cites 152 articlesThis article

4774 article(s) on the ISI Web of Science. cited byThis article has been

http://www.sciencemag.org/cgi/content/full/291/5507/1304#otherarticles 79 articles hosted by HighWire Press; see: cited byThis article has been

http://www.sciencemag.org/cgi/collection/geneticsGenetics

: subject collectionsThis article appears in the following

http://www.sciencemag.org/about/permissions.dtl in whole or in part can be found at: this article

permission to reproduce of this article or about obtaining reprintsInformation about obtaining

registered trademark of AAAS. is aScience2001 by the American Association for the Advancement of Science; all rights reserved. The title

CopyrightAmerican Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by theScience

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 2: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

The Sequence of the Human GenomeJ. Craig Venter,1* Mark D. Adams,1 Eugene W. Myers,1 Peter W. Li,1 Richard J. Mural,1

Granger G. Sutton,1 Hamilton O. Smith,1 Mark Yandell,1 Cheryl A. Evans,1 Robert A. Holt,1

Jeannine D. Gocayne,1 Peter Amanatides,1 Richard M. Ballew,1 Daniel H. Huson,1

Jennifer Russo Wortman,1 Qing Zhang,1 Chinnappa D. Kodira,1 Xiangqun H. Zheng,1 Lin Chen,1

Marian Skupski,1 Gangadharan Subramanian,1 Paul D. Thomas,1 Jinghui Zhang,1

George L. Gabor Miklos,2 Catherine Nelson,3 Samuel Broder,1 Andrew G. Clark,4 Joe Nadeau,5

Victor A. McKusick,6 Norton Zinder,7 Arnold J. Levine,7 Richard J. Roberts,8 Mel Simon,9

Carolyn Slayman,10 Michael Hunkapiller,11 Randall Bolanos,1 Arthur Delcher,1 Ian Dew,1 Daniel Fasulo,1

Michael Flanigan,1 Liliana Florea,1 Aaron Halpern,1 Sridhar Hannenhalli,1 Saul Kravitz,1 Samuel Levy,1

Clark Mobarry,1 Knut Reinert,1 Karin Remington,1 Jane Abu-Threideh,1 Ellen Beasley,1 Kendra Biddick,1

Vivien Bonazzi,1 Rhonda Brandon,1 Michele Cargill,1 Ishwar Chandramouliswaran,1 Rosane Charlab,1

Kabir Chaturvedi,1 Zuoming Deng,1 Valentina Di Francesco,1 Patrick Dunn,1 Karen Eilbeck,1

Carlos Evangelista,1 Andrei E. Gabrielian,1 Weiniu Gan,1 Wangmao Ge,1 Fangcheng Gong,1 Zhiping Gu,1

Ping Guan,1 Thomas J. Heiman,1 Maureen E. Higgins,1 Rui-Ru Ji,1 Zhaoxi Ke,1 Karen A. Ketchum,1

Zhongwu Lai,1 Yiding Lei,1 Zhenya Li,1 Jiayin Li,1 Yong Liang,1 Xiaoying Lin,1 Fu Lu,1

Gennady V. Merkulov,1 Natalia Milshina,1 Helen M. Moore,1 Ashwinikumar K Naik,1

Vaibhav A. Narayan,1 Beena Neelam,1 Deborah Nusskern,1 Douglas B. Rusch,1 Steven Salzberg,12

Wei Shao,1 Bixiong Shue,1 Jingtao Sun,1 Zhen Yuan Wang,1 Aihui Wang,1 Xin Wang,1 Jian Wang,1

Ming-Hui Wei,1 Ron Wides,13 Chunlin Xiao,1 Chunhua Yan,1 Alison Yao,1 Jane Ye,1 Ming Zhan,1

Weiqing Zhang,1 Hongyu Zhang,1 Qi Zhao,1 Liansheng Zheng,1 Fei Zhong,1 Wenyan Zhong,1

Shiaoping C. Zhu,1 Shaying Zhao,12 Dennis Gilbert,1 Suzanna Baumhueter,1 Gene Spier,1

Christine Carter,1 Anibal Cravchik,1 Trevor Woodage,1 Feroze Ali,1 Huijin An,1 Aderonke Awe,1

Danita Baldwin,1 Holly Baden,1 Mary Barnstead,1 Ian Barrow,1 Karen Beeson,1 Dana Busam,1

Amy Carver,1 Angela Center,1 Ming Lai Cheng,1 Liz Curry,1 Steve Danaher,1 Lionel Davenport,1

Raymond Desilets,1 Susanne Dietz,1 Kristina Dodson,1 Lisa Doup,1 Steven Ferriera,1 Neha Garg,1

Andres Gluecksmann,1 Brit Hart,1 Jason Haynes,1 Charles Haynes,1 Cheryl Heiner,1 Suzanne Hladun,1

Damon Hostin,1 Jarrett Houck,1 Timothy Howland,1 Chinyere Ibegwam,1 Jeffery Johnson,1

Francis Kalush,1 Lesley Kline,1 Shashi Koduru,1 Amy Love,1 Felecia Mann,1 David May,1

Steven McCawley,1 Tina McIntosh,1 Ivy McMullen,1 Mee Moy,1 Linda Moy,1 Brian Murphy,1

Keith Nelson,1 Cynthia Pfannkoch,1 Eric Pratts,1 Vinita Puri,1 Hina Qureshi,1 Matthew Reardon,1

Robert Rodriguez,1 Yu-Hui Rogers,1 Deanna Romblad,1 Bob Ruhfel,1 Richard Scott,1 Cynthia Sitter,1

Michelle Smallwood,1 Erin Stewart,1 Renee Strong,1 Ellen Suh,1 Reginald Thomas,1 Ni Ni Tint,1

Sukyee Tse,1 Claire Vech,1 Gary Wang,1 Jeremy Wetter,1 Sherita Williams,1 Monica Williams,1

Sandra Windsor,1 Emily Winn-Deen,1 Keriellen Wolfe,1 Jayshree Zaveri,1 Karena Zaveri,1

Josep F. Abril,14 Roderic Guigo,14 Michael J. Campbell,1 Kimmen V. Sjolander,1 Brian Karlak,1

Anish Kejariwal,1 Huaiyu Mi,1 Betty Lazareva,1 Thomas Hatton,1 Apurva Narechania,1 Karen Diemer,1

Anushya Muruganujan,1 Nan Guo,1 Shinji Sato,1 Vineet Bafna,1 Sorin Istrail,1 Ross Lippert,1

Russell Schwartz,1 Brian Walenz,1 Shibu Yooseph,1 David Allen,1 Anand Basu,1 James Baxendale,1

Louis Blick,1 Marcelo Caminha,1 John Carnes-Stine,1 Parris Caulk,1 Yen-Hui Chiang,1 My Coyne,1

Carl Dahlke,1 Anne Deslattes Mays,1 Maria Dombroski,1 Michael Donnelly,1 Dale Ely,1 Shiva Esparham,1

Carl Fosler,1 Harold Gire,1 Stephen Glanowski,1 Kenneth Glasser,1 Anna Glodek,1 Mark Gorokhov,1

Ken Graham,1 Barry Gropman,1 Michael Harris,1 Jeremy Heil,1 Scott Henderson,1 Jeffrey Hoover,1

Donald Jennings,1 Catherine Jordan,1 James Jordan,1 John Kasha,1 Leonid Kagan,1 Cheryl Kraft,1

Alexander Levitsky,1 Mark Lewis,1 Xiangjun Liu,1 John Lopez,1 Daniel Ma,1 William Majoros,1

Joe McDaniel,1 Sean Murphy,1 Matthew Newman,1 Trung Nguyen,1 Ngoc Nguyen,1 Marc Nodell,1

Sue Pan,1 Jim Peck,1 Marshall Peterson,1 William Rowe,1 Robert Sanders,1 John Scott,1

Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy Stockwell,1 Russell Turner,1 Eli Venter,1

Mei Wang,1 Meiyuan Wen,1 David Wu,1 Mitchell Wu,1 Ashley Xia,1 Ali Zandieh,1 Xiaohong Zhu1

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1304

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 3: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion ofthe human genome was generated by the whole-genome shotgun sequencingmethod. The 14.8-billion bp DNA sequence was generated over 9 months from27,271,853 high-quality sequence reads (5.11-fold coverage of the genome)from both ends of plasmid clones made from the DNA of five individuals. Twoassembly strategies—a whole-genome assembly and a regional chromosomeassembly—were used, each combining sequence data from Celera and thepublicly funded genome effort. The public data were shredded into 550-bpsegments to create a 2.9-fold coverage of those genome regions that had beensequenced, without including biases inherent in the cloning and assemblyprocedure used by the publicly funded group. This brought the effective cov-erage in the assemblies to eightfold, reducing the number and size of gaps inthe final assembly over what would be obtained with 5.11-fold coverage. Thetwo assembly strategies yielded very similar results that largely agree withindependent mapping data. The assemblies effectively cover the euchromaticregions of the human chromosomes. More than 90% of the genome is inscaffold assemblies of 100,000 bp or more, and 25% of the genome is inscaffolds of 10 million bp or larger. Analysis of the genome sequence revealed26,588 protein-encoding transcripts for which there was strong corroboratingevidence and an additional ;12,000 computationally derived genes with mousematches or other weak supporting evidence. Although gene-dense clusters areobvious, almost half the genes are dispersed in low G1C sequence separatedby large tracts of apparently noncoding sequence. Only 1.1% of the genomeis spanned by exons, whereas 24% is in introns, with 75% of the genome beingintergenic DNA. Duplications of segmental blocks, ranging in size up to chro-mosomal lengths, are abundant throughout the genome and reveal a complexevolutionary history. Comparative genomic analysis indicates vertebrate ex-pansions of genes associated with neuronal function, with tissue-specific de-velopmental regulation, and with the hemostasis and immune systems. DNAsequence comparisons between the consensus sequence and publicly fundedgenome data provided locations of 2.1 million single-nucleotide polymorphisms(SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per1250 on average, but there was marked heterogeneity in the level of poly-morphism across the genome. Less than 1% of all SNPs resulted in variation inproteins, but the task of determining which SNPs have functional consequencesremains an open challenge.

Decoding of the DNA that constitutes thehuman genome has been widely anticipatedfor the contribution it will make toward un-

derstanding human evolution, the causationof disease, and the interplay between theenvironment and heredity in defining the hu-man condition. A project with the goal ofdetermining the complete nucleotide se-quence of the human genome was first for-mally proposed in 1985 (1). In subsequentyears, the idea met with mixed reactions inthe scientific community (2). However, in1990, the Human Genome Project (HGP) wasofficially initiated in the United States underthe direction of the National Institutes ofHealth and the U.S. Department of Energywith a 15-year, $3 billion plan for completingthe genome sequence. In 1998 we announcedour intention to build a unique genome-sequencing facility, to determine the se-quence of the human genome over a 3-yearperiod. Here we report the penultimate mile-stone along the path toward that goal, a nearlycomplete sequence of the euchromatic por-tion of the human genome. The sequencingwas performed by a whole-genome randomshotgun method with subsequent assembly ofthe sequenced segments.

The modern history of DNA sequencingbegan in 1977, when Sanger reported his meth-od for determining the order of nucleotides of

DNA using chain-terminating nucleotide ana-logs (3). In the same year, the first human genewas isolated and sequenced (4). In 1986, Hoodand co-workers (5) described an improvementin the Sanger sequencing method that includedattaching fluorescent dyes to the nucleotides,which permitted them to be sequentially readby a computer. The first automated DNA se-quencer, developed by Applied Biosystems inCalifornia in 1987, was shown to be successfulwhen the sequences of two genes were obtainedwith this new technology (6). From early se-quencing of human genomic regions (7), itbecame clear that cDNA sequences (which arereverse-transcribed from RNA) would be es-sential to annotate and validate gene predictionsin the human genome. These studies were thebasis in part for the development of the ex-pressed sequence tag (EST) method of geneidentification (8), which is a random selection,very high throughput sequencing approach tocharacterize cDNA libraries. The EST methodled to the rapid discovery and mapping of hu-man genes (9). The increasing numbers of hu-man EST sequences necessitated the develop-ment of new computer algorithms to analyzelarge amounts of sequence data, and in 1993 atThe Institute for Genomic Research (TIGR), analgorithm was developed that permitted assem-bly and analysis of hundreds of thousands ofESTs. This algorithm permitted characteriza-tion and annotation of human genes on the basisof 30,000 EST assemblies (10).

The complete 49-kbp bacteriophage lamb-da genome sequence was determined by ashotgun restriction digest method in 1982(11). When considering methods for sequenc-ing the smallpox virus genome in 1991 (12),a whole-genome shotgun sequencing methodwas discussed and subsequently rejected ow-ing to the lack of appropriate software toolsfor genome assembly. However, in 1994,when a microbial genome-sequencing projectwas contemplated at TIGR, a whole-genomeshotgun sequencing approach was consideredpossible with the TIGR EST assembly algo-rithm. In 1995, the 1.8-Mbp Haemophilusinfluenzae genome was completed by awhole-genome shotgun sequencing method(13). The experience with several subsequentgenome-sequencing efforts established thebroad applicability of this approach (14, 15).

A key feature of the sequencing approachused for these megabase-size and larger ge-nomes was the use of paired-end sequences(also called mate pairs), derived from sub-clone libraries with distinct insert sizes andcloning characteristics. Paired-end sequencesare sequences 500 to 600 bp in length fromboth ends of double-stranded DNA clones ofprescribed lengths. The success of using endsequences from long segments (18 to 20 kbp)of DNA cloned into bacteriophage lambda inassembly of the microbial genomes led to thesuggestion (16 ) of an approach to simulta-

1Celera Genomics, 45 West Gude Drive, Rockville, MD20850, USA. 2GenetixXpress, 78 Pacific Road, PalmBeach, Sydney 2108, Australia. 3Berkeley DrosophilaGenome Project, University of California, Berkeley, CA94720, USA. 4Department of Biology, Penn State Uni-versity, 208 Mueller Lab, University Park, PA 16802,USA. 5Department of Genetics, Case Western ReserveUniversity School of Medicine, BRB-630, 10900 EuclidAvenue, Cleveland, OH 44106, USA. 6Johns HopkinsUniversity School of Medicine, Johns Hopkins Hospi-tal, 600 North Wolfe Street, Blalock 1007, Baltimore,MD 21287–4922, USA. 7Rockefeller University, 1230York Avenue, New York, NY 10021–6399, USA. 8NewEngland BioLabs, 32 Tozer Road, Beverly, MA 01915,USA. 9Division of Biology, 147-75, California Instituteof Technology, 1200 East California Boulevard, Pasa-dena, CA 91125, USA. 10Yale University School ofMedicine, 333 Cedar Street, P.O. Box 208000, NewHaven, CT 06520–8000, USA. 11Applied Biosystems,850 Lincoln Centre Drive, Foster City, CA 94404, USA.12The Institute for Genomic Research, 9712 MedicalCenter Drive, Rockville, MD 20850, USA. 13Faculty ofLife Sciences, Bar-Ilan University, Ramat-Gan, 52900Israel. 14Grup de Recerca en Informatica Medica, In-stitut Municipal d’Investigacio Medica, UniversitatPompeu Fabra, 08003-Barcelona, Catalonia, Spain.

*To whom correspondence should be addressed. E-mail: [email protected]

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1305

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 4: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

neously map and sequence the human ge-nome by means of end sequences from 150-kbp bacterial artificial chromosomes (BACs)(17, 18). The end sequences spanned byknown distances provide long-range continu-ity across the genome. A modification of theBAC end-sequencing (BES) method was ap-plied successfully to complete chromosome 2from the Arabidopsis thaliana genome (19).

In 1997, Weber and Myers (20) proposedwhole-genome shotgun sequencing of thehuman genome. Their proposal was not wellreceived (21). However, by early 1998, asless than 5% of the genome had been se-quenced, it was clear that the rate of progressin human genome sequencing worldwidewas very slow (22), and the prospects forfinishing the genome by the 2005 goal wereuncertain.

In early 1998, PE Biosystems (now AppliedBiosystems) developed an automated, high-throughput capillary DNA sequencer, subse-quently called the ABI PRISM 3700 DNAAnalyzer. Discussions between PE Biosystemsand TIGR scientists resulted in a plan to under-take the sequencing of the human genome withthe 3700 DNA Analyzer and the whole-genomeshotgun sequencing techniques developed atTIGR (23). Many of the principles of operationof a genome-sequencing facility were estab-lished in the TIGR facility (24). However, thefacility envisioned for Celera would have acapacity roughly 50 times that of TIGR, andthus new developments were required for sam-ple preparation and tracking and for whole-genome assembly. Some argued that the re-quired 150-fold scale-up from the H. influenzaegenome to the human genome with its complexrepeat sequences was not feasible (25). TheDrosophila melanogaster genome was thuschosen as a test case for whole-genome assem-bly on a large and complex eukaryotic genome.In collaboration with Gerald Rubin and theBerkeley Drosophila Genome Project, the nu-cleotide sequence of the 120-Mbp euchromaticportion of the Drosophila genome was deter-mined over a 1-year period (26–28). The Dro-sophila genome-sequencing effort resulted intwo key findings: (i) that the assembly algo-rithms could generate chromosome assemblieswith highly accurate order and orientation withsubstantially less than 10-fold coverage, and (ii)that undertaking multiple interim assemblies inplace of one comprehensive final assembly wasnot of value.

These findings, together with the dramaticchanges in the public genome effort subsequentto the formation of Celera (29), led to a modi-fied whole-genome shotgun sequencing ap-proach to the human genome. We initially pro-posed to do 10-fold sequence coverage of thegenome over a 3-year period and to make in-terim assembled sequence data available quar-terly. The modifications included a plan to per-form random shotgun sequencing to ;5-fold

coverage and to use the unordered and unori-ented BAC sequence fragments and subassem-blies published in GenBank by the publiclyfunded genome effort (30) to accelerate theproject. We also abandoned the quarterly an-nouncements in the absence of interim assem-blies to report.

Although this strategy provided a reason-able result very early that was consistent with awhole-genome shotgun assembly with eight-fold coverage, the human genome sequence isnot as finished as the Drosophila genome waswith an effective 13-fold coverage. However, itbecame clear that even with this reduced cov-erage strategy, Celera could generate an accu-rately ordered and oriented scaffold sequence ofthe human genome in less than 1 year. Humangenome sequencing was initiated 8 September1999 and completed 17 June 2000. The firstassembly was completed 25 June 2000, and theassembly reported here was completed 1 Octo-ber 2000. Here we describe the whole-genomerandom shotgun sequencing effort applied tothe human genome. We developed two differ-ent assembly approaches for assembling the ;3billion bp that make up the 23 pairs of chromo-somes of the Homo sapiens genome. Any Gen-Bank-derived data were shredded to removepotential bias to the final sequence from chi-meric clones, foreign DNA contamination, ormisassembled contigs. Insofar as a correctlyand accurately assembled genome sequencewith faithful order and orientation of contigsis essential for an accurate analysis of thehuman genetic code, we have devoted a con-siderable portion of this manuscript to thedocumentation of the quality of our recon-struction of the genome. We also describe ourpreliminary analysis of the human geneticcode on the basis of computational methods.Figure 1 (see fold-out chart associated withthis issue; files for each chromosome can befound in Web fig. 1 on Science Online atwww.sciencemag.org/cgi/content/full/291/5507/1304/DC1) provides a graphical over-view of the genome and the features encodedin it. The detailed manual curation and inter-pretation of the genome are just beginning.

To aid the reader in locating specific an-alytical sections, we have divided the paperinto seven broad sections. A summary of themajor results appears at the beginning of eachsection.

1 Sources of DNA and Sequencing Methods2 Genome Assembly Strategy and

Characterization3 Gene Prediction and Annotation4 Genome Structure5 Genome Evolution6 A Genome-Wide Examination of

Sequence Variations7 An Overview of the Predicted Protein-

Coding Genes in the Human Genome8 Conclusions

1 Sources of DNA and SequencingMethods

Summary. This section discusses the rationaleand ethical rules governing donor selection toensure ethnic and gender diversity along withthe methodologies for DNA extraction and li-brary construction. The plasmid library con-struction is the first critical step in shotgunsequencing. If the DNA libraries are not uni-form in size, nonchimeric, and do not randomlyrepresent the genome, then the subsequent stepscannot accurately reconstruct the genome se-quence. We used automated high-throughputDNA sequencing and the computational infra-structure to enable efficient tracking of enor-mous amounts of sequence information (27.3million sequence reads; 14.9 billion bp of se-quence). Sequencing and tracking from bothends of plasmid clones from 2-, 10-, and 50-kbplibraries were essential to the computationalreconstruction of the genome. Our evidenceindicates that the accurate pairing rate of endsequences was greater than 98%.

Various policies of the United States and theWorld Medical Association, specifically theDeclaration of Helsinki, offer recommenda-tions for conducting experiments with humansubjects. We convened an Institutional Re-view Board (IRB) (31) that helped us estab-lish the protocol for obtaining and using hu-man DNA and the informed consent processused to enroll research volunteers for theDNA-sequencing studies reported here. Weadopted several steps and procedures to pro-tect the privacy rights and confidentiality ofthe research subjects (donors). These includ-ed a two-stage consent process, a secure ran-dom alphanumeric coding system for speci-mens and records, circumscribed contact withthe subjects by researchers, and options foroff-site contact of donors. In addition, Celeraapplied for and received a Certificate of Con-fidentiality from the Department of Healthand Human Services. This Certificate autho-rized Celera to protect the privacy of theindividuals who volunteered to be donors asprovided in Section 301(d) of the PublicHealth Service Act 42 U.S.C. 241(d).

Celera and the IRB believed that the ini-tial version of a completed human genomeshould be a composite derived from multipledonors of diverse ethnic backgrounds Pro-spective donors were asked, on a voluntarybasis, to self-designate an ethnogeographiccategory (e.g., African-American, Chinese,Hispanic, Caucasian, etc.). We enrolled 21donors (32).

Three basic items of information fromeach donor were recorded and linked by con-fidential code to the donated sample: age,sex, and self-designated ethnogeographicgroup. From females, ;130 ml of whole,heparinized blood was collected. From males,;130 ml of whole, heparinized blood was

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1306

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 5: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

collected, as well as five specimens of semen,collected over a 6-week period. Permanentlymphoblastoid cell lines were created byEpstein-Barr virus immortalization. DNAfrom five subjects was selected for genomicDNA sequencing: two males and three fe-males—one African-American, one Asian-Chinese, one Hispanic-Mexican, and twoCaucasians (see Web fig. 2 on Science Onlineat www.sciencemag.org/cgi/content/291/5507/1304/DC1). The decision of whose DNA tosequence was based on a complex mix of fac-tors, including the goal of achieving diversity aswell as technical issues such as the quality ofthe DNA libraries and availability of immortal-ized cell lines.

1.1 Library construction andsequencingCentral to the whole-genome shotgun sequenc-ing process is preparation of high-quality plas-mid libraries in a variety of insert sizes so thatpairs of sequence reads (mates) are obtained,one read from both ends of each plasmid insert.High-quality libraries have an equal representa-tion of all parts of the genome, a small numberof clones without inserts, and no contaminationfrom such sources as the mitochondrial genomeand Escherichia coli genomic DNA. DNA fromeach donor was used to construct plasmid librar-ies in one or more of three size classes: 2 kbp, 10kbp, and 50 kbp (Table 1) (33).

In designing the DNA-sequencing pro-cess, we focused on developing a simplesystem that could be implemented in a robustand reproducible manner and monitored ef-fectively (Fig. 2) (34 ).

Current sequencing protocols are based on

the dideoxy sequencing method (35), whichtypically yields only 500 to 750 bp of sequenceper reaction. This limitation on read length hasmade monumental gains in throughput a pre-requisite for the analysis of large eukaryoticgenomes. We accomplished this at the Celerafacility, which occupies about 30,000 squarefeet of laboratory space and produces sequencedata continuously at a rate of 175,000 totalreads per day. The DNA-sequencing facility issupported by a high-performance computation-al facility (36).

The process for DNA sequencing was mod-ular by design and automated. Intermodulesample backlogs allowed four principalmodules to operate independently: (i) li-brary transformation, plating, and colonypicking; (ii) DNA template preparation;(iii) dideoxy sequencing reaction set-upand purification; and (iv) sequence deter-mination with the ABI PRISM 3700 DNAAnalyzer. Because the inputs and outputsof each module have been carefullymatched and sample backlogs are continu-ously managed, sequencing has proceededwithout a single day’s interruption since theinitiation of the Drosophila project in May1999. The ABI 3700 is a fully automatedcapillary array sequencer and as such canbe operated with a minimal amount ofhands-on time, currently estimated at about15 min per day. The capillary system alsofacilitates correct associations of sequenc-ing traces with samples through the elimi-nation of manual sample loading and lane-tracking errors associated with slab gels.About 65 production staff were hired andtrained, and were rotated on a regular basis

through the four production modules. Acentral laboratory information managementsystem (LIMS) tracked all sample plates byunique bar code identifiers. The facility wassupported by a quality control team that per-formed raw material and in-process testingand a quality assurance group with responsi-bilities including document control, valida-tion, and auditing of the facility. Critical tothe success of the scale-up was the validationof all software and instrumentation beforeimplementation, and production-scale testingof any process changes.

1.2 Trace processingAn automated trace-processing pipeline hasbeen developed to process each sequence file(37 ). After quality and vector trimming, theaverage trimmed sequence length was 543bp, and the sequencing accuracy was expo-nentially distributed with a mean of 99.5%and with less than 1 in 1000 reads being lessthan 98% accurate (26 ). Each trimmed se-quence was screened for matches to contam-inants including sequences of vector alone, E.coli genomic DNA, and human mitochondri-al DNA. The entire read for any sequencewith a significant match to a contaminant wasdiscarded. A total of 713 reads matched E.coli genomic DNA and 2114 reads matchedthe human mitochondrial genome.

1.3 Quality assessment and controlThe importance of the base-pair level ac-curacy of the sequence data increases as thesize and repetitive nature of the genome tobe sequenced increases. Each sequenceread must be placed uniquely in the ge-

Table 1. Celera-generated data input into assembly.

IndividualNumber of reads for different insert libraries

Total number ofbase pairs

2 kbp 10 kbp 50 kbp Total

No. of sequencing reads A 0 0 2,767,357 2,767,357 1,502,674,851B 11,736,757 7,467,755 66,930 19,271,442 10,464,393,006C 853,819 881,290 0 1,735,109 942,164,187D 952,523 1,046,815 0 1,999,338 1,085,640,534F 0 1,498,607 0 1,498,607 813,743,601

Total 13,543,099 10,894,467 2,834,287 27,271,853 14,808,616,179

Fold sequence coverage A 0 0 0.52 0.52(2.9-Gb genome) B 2.20 1.40 0.01 3.61

C 0.16 1.17 0 0.32D 0.18 0.20 0 0.37F 0 0.28 0 0.28

Total 2.54 2.04 0.53 5.11

Fold clone coverage A 0 0 18.39 18.39B 2.96 11.26 0.44 14.67C 0.22 1.33 0 1.54D 0.24 1.58 0 1.82F 0 2.26 0 2.26

Total 3.42 16.43 18.84 38.68

Insert size* (mean) Average 1,951 bp 10,800 bp 50,715 bpInsert size* (SD) Average 6.10% 8.10% 14.90%% Mates† Average 74.50 80.80 75.60

*Insert size and SD are calculated from assembly of mates on contigs. †% Mates is based on laboratory tracking of sequencing runs.

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1307

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 6: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

nome, and even a modest error rate canreduce the effectiveness of assembly. Inaddition, maintaining the validity of mate-pair information is absolutely critical forthe algorithms described below. Proceduralcontrols were established for maintainingthe validity of sequence mate-pairs as se-quencing reactions proceeded through theprocess, including strict rules built into theLIMS. The accuracy of sequence data pro-duced by the Celera process was validatedin the course of the Drosophila genomeproject (26 ). By collecting data for the

entire human genome in a single facility,we were able to ensure uniform qualitystandards and the cost advantages associat-ed with automation, an economy of scale,and process consistency.

2 Genome Assembly Strategy andCharacterizationSummary. We describe in this section the twoapproaches that we used to assemble the ge-nome. One method involves the computationalcombination of all sequence reads with shred-ded data from GenBank to generate an indepen-

dent, nonbiased view of the genome. The sec-ond approach involves clustering all of the frag-ments to a region or chromosome on the basisof mapping information. The clustered datawere then shredded and subjected to computa-tional assembly. Both approaches provided es-sentially the same reconstruction of assembledDNA sequence with proper order and orienta-tion. The second method provided slightlygreater sequence coverage (fewer gaps) andwas the principal sequence used for the analysisphase. In addition, we document the complete-ness and correctness of this assembly process

Fig. 2. Flow diagram for sequencing pipeline. Samples are received,selected, and processed in compliance with standard operating proce-dures, with a focus on quality within and across departments. Eachprocess has defined inputs and outputs with the capability to exchange

samples and data with both internal and external entities according todefined quality guidelines. Manufacturing pipeline processes, products,quality control measures, and responsible parties are indicated and aredescribed further in the text.

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1308

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 7: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

and provide a comparison to the public genomesequence, which was reconstructed largely byan independent BAC-by-BAC approach. Ourassemblies effectively covered the euchromaticregions of the human chromosomes. More than90% of the genome was in scaffold assembliesof 100,000 bp or greater, and 25% of the ge-nome was in scaffolds of 10 million bp orlarger.

Shotgun sequence assembly is a classicexample of an inverse problem: given a setof reads randomly sampled from a targetsequence, reconstruct the order and the po-sition of those reads in the target. Genomeassembly algorithms developed for Dro-sophila have now been extended to assemblethe ;25-fold larger human genome. Celera as-semblies consist of a set of contigs that areordered and oriented into scaffolds that are thenmapped to chromosomal locations by usingknown markers. The contigs consist of a col-lection of overlapping sequence reads that pro-vide a consensus reconstruction for a contigu-ous interval of the genome. Mate pairs are acentral component of the assembly strategy.They are used to produce scaffolds in which thesize of gaps between consecutive contigs isknown with reasonable precision. This is ac-complished by observing that a pair of reads,one of which is in one contig, and the other ofwhich is in another, implies an orientation anddistance between the two contigs (Fig. 3). Fi-nally, our assemblies did not incorporate allreads into the final set of reported scaffolds.This set of unincorporated reads is termed“chaff,” and typically consisted of reads fromwithin highly repetitive regions, data from otherorganisms introduced through various routes asfound in many genome projects, and data ofpoor quality or with untrimmed vector.

2.1 Assembly data setsWe used two independent sets of data for ourassemblies. The first was a random shotgundata set of 27.27 million reads of average length543 bp produced at Celera. This consistedlargely of mate-pair reads from 16 librariesconstructed from DNA samples taken from fivedifferent donors. Libraries with insert sizes of 2,10, and 50 kbp were used. By looking at howmate pairs from a library were positioned inknown sequenced stretches of the genome, wewere able to characterize the range of insertsizes in each library and determine a mean andstandard deviation. Table 1 details the numberof reads, sequencing coverage, and clone cov-erage achieved by the data set. The clone cov-erage is the coverage of the genome in clonedDNA, considering the entire insert of eachclone that has sequence from both ends. Theclone coverage provides a measure of theamount of physical DNA coverage of the ge-nome. Assuming a genome size of 2.9 Gbp, theCelera trimmed sequences gave a 5.13 cover-age of the genome, and clone coverage was3.423, 16.403, and 18.843 for the 2-, 10-, and50-kbp libraries, respectively, for a total of38.73 clone coverage.

The second data set was from the publiclyfunded Human Genome Project (PFP) and isprimarily derived from BAC clones (30). TheBAC data input to the assemblies came from adownload of GenBank on 1 September 2000(Table 2) totaling 4443.3 Mbp of sequence.The data for each BAC is deposited at one offour levels of completion. Phase 0 data are a setof generally unassembled sequencing readsfrom a very light shotgun of the BAC, typicallyless than 13. Phase 1 data are unordered as-semblies of contigs, which we call BAC contigsor bactigs. Phase 2 data are ordered assembliesof bactigs. Phase 3 data are complete BAC

sequences. In the past 2 years the PFP hasfocused on a product of lower quality and com-pleteness, but on a faster time-course, by con-centrating on the production of Phase 1 datafrom a 33 to 43 light-shotgun of each BACclone.

We screened the bactig sequences for con-taminants by using the BLAST algorithmagainst three data sets: (i) vector sequencesin Univec core (38), filtered for a 25-bpmatch at 98% sequence identity at the endsof the sequence and a 30-bp match internalto the sequence; (ii) the nonhuman portionof the High Throughput Genomic (HTG)Seqences division of GenBank (39), fil-tered at 200 bp at 98%; and (iii) the non-redundant nucleotide sequences from Gen-Bank without primate and human virus en-tries, filtered at 200 bp at 98%. Whenever25 bp or more of vector was found within50 bp of the end of a contig, the tip up tothe matching vector was excised. Underthese criteria we removed 2.6 Mbp of pos-sible contaminant and vector from thePhase 3 data, 61.0 Mbp from the Phase 1and 2 data, and 16.1 Mbp from the Phase 0data (Table 2). This left us with a total of4363.7 Mbp of PFP sequence data 20%finished, 75% rough-draft (Phase 1 and 2),and 5% single sequencing reads (Phase 0).An additional 104,018 BAC end-sequencemate pairs were also downloaded and in-cluded in the data sets for both assemblyprocesses (18).

2.2 Assembly strategiesTwo different approaches to assembly werepursued. The first was a whole-genome as-sembly process that used Celera data and thePFP data in the form of additional syntheticshotgun data, and the second was a compart-mentalized assembly process that first parti-tioned the Celera and PFP data into setslocalized to large chromosomal segments andthen performed ab initio shotgun assembly oneach set. Figure 4 gives a schematic of theoverall process flow.

For the whole-genome assembly, the PFPdata was first disassembled or “shredded” into asynthetic shotgun data set of 550-bp reads thatform a perfect 23 covering of the bactigs. Thisresulted in 16.05 million “faux” reads that weresufficient to cover the genome 2.963 becauseof redundancy in the BAC data set, withoutincorporating the biases inherent in the PFPassembly process. The combined data set of43.32 million reads (83), and all associatedmate-pair information, were then subjected toour whole-genome assembly algorithm to pro-duce a reconstruction of the genome. Neitherthe location of a BAC in the genome nor itsassembly of bactigs was used in this process.Bactigs were shredded into reads because wefound strong evidence that 2.13% of them weremisassembled (40). Furthermore, BAC location

Fig. 3. Anatomy of whole-genome assembly. Overlapping shredded bactig fragments (red lines) andinternally derived reads from five different individuals (black lines) are combined to produce acontig and a consensus sequence (green line). Contigs are connected into scaffolds (red) by usingmate pair information. Scaffolds are then mapped to the genome (gray line) with STS (blue star)physical map information.

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1309

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 8: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

information was ignored because some BACswere not correctly placed on the PFP physicalmap and because we found strong evidence that

at least 2.2% of the BACs contained sequencedata that were not part of the given BAC (41),possibly as a result of sample-tracking errors

(see below). In short, we performed a true, abinitio whole-genome assembly in which wetook the expedient of deriving additional se-quence coverage, but not mate pairs, assembledbactigs, or genome locality, from some exter-nally generated data.

In the compartmentalized shotgun assembly(CSA), Celera and PFP data were partitionedinto the largest possible chromosomal segmentsor “components” that could be determined withconfidence, and then shotgun assembly was ap-plied to each partitioned subset wherein thebactig data were again shredded into faux readsto ensure an independent ab initio assembly ofthe component. By subsetting the data in thisway, the overall computational effort was re-duced and the effect of interchromosomal dupli-cations was ameliorated. This also resulted in areconstruction of the genome that was relativelyindependent of the whole-genome assembly re-sults so that the two assemblies could be com-pared for consistency. The quality of the parti-tioning into components was crucial so thatdifferent genome regions were not mixed to-gether. We constructed components from (i) thelongest scaffolds of the sequence from eachBAC and (ii) assembled scaffolds of data uniqueto Celera’s data set. The BAC assemblies wereobtained by a combining assembler that used thebactigs and the 53 Celera data mapped to thosebactigs as input. This effort was undertaken asan interim step solely because the more accurateand complete the scaffold for a given sequencestretch, the more accurately one can tile thesescaffolds into contiguous components on thebasis of sequence overlap and mate-pair infor-mation. We further visually inspected and cu-rated the scaffold tiling of the components tofurther increase its accuracy. For the final CSAassembly, all but the partitioning was ignored,and an independent, ab initio reconstruction ofthe sequence in each component was obtainedby applying our whole-genome assembly algo-rithm to the partitioned, relevant Celera data andthe shredded, faux reads of the partitioned, rel-evant bactig data.

2.3 Whole-genome assemblyThe algorithms used for whole-genome as-sembly (WGA) of the human genome wereenhancements to those used to produce thesequence of the Drosophila genome reportedin detail in (28).

The WGA assembler consists of a pipelinecomposed of five principal stages: Screener,Overlapper, Unitigger, Scaffolder, and RepeatResolver, respectively. The Screener findsand marks all microsatellite repeats with lessthan a 6-bp element, and screens out allknown interspersed repeat elements, includ-ing Alu, Line, and ribosomal DNA. Markedregions get searched for overlaps, whereasscreened regions do not get searched, but canbe part of an overlap that involves unscreenedmatching segments.

Table 2. GenBank data input into assembly.

Center StatisticsCompletion phase sequence

0 1 and 2 3

Whitehead Institute/ Number of accession records 2,825 6,533 363MIT Center for Number of contigs 243,786 138,023 363Genome Research, Total base pairs 194,490,158 1,083,848,245 48,829,358USA Total vector masked (bp) 1,553,597 875,618 2,202

Total contaminant masked(bp)

13,654,482 4,417,055 98,028

Average contig length (bp) 798 7,853 134,516

Washington University, Number of accession records 19 3,232 1,300USA Number of contigs 2,127 61,812 1,300

Total base pairs 1,195,732 561,171,788 164,214,395Total vector masked (bp) 21,604 270,942 8,287Total contaminant masked

(bp)22,469 1,476,141 469,487

Average contig length (bp) 562 9,079 126,319

Baylor College of Number of accession records 0 1,626 363Medicine, USA Number of contigs 0 44,861 363

Total base pairs 0 265,547,066 49,017,104Total vector masked (bp) 0 218,769 4,960Total contaminant masked

(bp)0 1,784,700 485,137

Average contig length (bp) 0 5,919 135,033

Production Sequencing Number of accession records 135 2,043 754Facility, DOE Joint Number of contigs 7,052 34,938 754Genome Institute, Total base pairs 8,680,214 294,249,631 60,975,328USA Total vector masked (bp) 22,644 162,651 7,274

Total contaminant masked(bp)

665,818 4,642,372 118,387

Average contig length (bp) 1,231 8,422 80,867

The Institute of Physical Number of accession records 0 1,149 300and Chemical Number of contigs 0 25,772 300Research (RIKEN), Total base pairs 0 182,812,275 20,093,926Japan Total vector masked (bp) 0 203,792 2,371

Total contaminant masked (bp) 0 308,426 27,781Average contig length (bp) 0 7,093 66,978

Sanger Centre, UK Number of accession records 0 4,538 2,599Number of contigs 0 74,324 2,599Total base pairs 0 689,059,692 246,118,000Total vector masked (bp) 0 427,326 25,054Total contaminant masked (bp) 0 2,066,305 374,561Average contig length (bp) 0 9,271 94,697

Others* Number of accession records 42 1,894 3,458Number of contigs 5,978 29,898 3,458Total base pairs 5,564,879 283,358,877 246,474,157Total vector masked (bp) 57,448 279,477 32,136Total contaminant masked

(bp)575,366 1,616,665 1,791,849

Average contig length (bp) 931 9,478 71,277

All centers combined† Number of accession records 3,021 21,015 9,137Number of contigs 258,943 409,628 9,137Total base pairs 209,930,983 3,360,047,574 835,722,268Total vector masked (bp) 1,655,293 2,438,575 82,284Total contaminant masked

(bp)14,918,135 16,311,664 3,365,230

Average contig length (bp) 811 8,203 91,466

*Other centers contributing at least 0.1% of the sequence include: Chinese National Human Genome Center;Genomanalyse Gesellschaft fuer Biotechnologische Forschung mbH; Genome Therapeutics Corporation; GENOSCOPE;Chinese Academy of Sciences; Institute of Molecular Biotechnology; Keio University School of Medicine; LawrenceLivermore National Laboratory; Cold Spring Harbor Laboratory; Los Alamos National Laboratory; Max-Planck Institut fuerMolekulare, Genetik; Japan Science and Technology Corporation; Stanford University; The Institute for GenomicResearch; The Institute of Physical and Chemical Research, Gene Bank; The University of Oklahoma; University of TexasSouthwestern Medical Center, University of Washington. †The 4,405,700,825 bases contributed by all centers wereshredded into faux reads resulting in 2.963 coverage of the genome.

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1310

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 9: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

The Overlapper compares every readagainst every other read in search of completeend-to-end overlaps of at least 40 bp and withno more than 6% differences in the match.Because all data are scrupulously vector-trimmed, the Overlapper can insist on com-plete overlap matches. Computing the set ofall overlaps took roughly 10,000 CPU hourswith a suite of four-processor Alpha SMPswith 4 gigabytes of RAM. This took 4 to 5days in elapsed time with 40 such machinesoperating in parallel.

Every overlap computed above is statisti-cally a 1-in-1017 event and thus not a coinci-dental event. What makes assembly combi-natorially difficult is that while many over-laps are actually sampled from overlappingregions of the genome, and thus imply thatthe sequence reads should be assembled to-gether, even more overlaps are actually fromtwo distinct copies of a low-copy repeatedelement not screened above, thus constitutingan error if put together. We call the former“true overlaps” and the latter “repeat-inducedoverlaps.” The assembler must avoid choos-ing repeat-induced overlaps, especially earlyin the process.

We achieve this objective in the Unitig-ger. We first find all assemblies of reads thatappear to be uncontested with respect to allother reads. We call the contigs formed fromthese subassemblies unitigs (for uniquely as-sembled contigs). Formally, these unitigs arethe uncontested interval subgraphs of thegraph of all overlaps (42). Unfortunately, al-though empirically many of these assembliesare correct (and thus involve only true over-laps), some are in fact collections of readsfrom several copies of a repetitive elementthat have been overcollapsed into a singlesubassembly. However, the overcollapsedunitigs are easily identified because their av-erage coverage depth is too high to be con-sistent with the overall level of sequencecoverage. We developed a simple statisticaldiscriminator that gives the logarithm of theodds ratio that a unitig is composed of uniqueDNA or of a repeat consisting of two or morecopies. The discriminator, set to a sufficientlystringent threshold, identifies a subset of theunitigs that we are certain are correct. Inaddition, a second, less stringent thresholdidentifies a subset of remaining unitigs verylikely to be correctly assembled, of which weselect those that will consistently scaffold(see below), and thus are again almost certainto be correct. We call the union of these twosets U-unitigs. Empirically, we found from a63 simulated shotgun of human chromosome22 that we get U-unitigs covering 98% of thestretches of unique DNA that are .2 kbplong. We are further able to identify theboundary of the start of a repetitive elementat the ends of a U-unitig and leverage this sothat U-unitigs span more than 93% of all

singly interspersed Alu elements and other100-to 400-bp repetitive segments.

The result of running the Unitigger wasthus a set of correctly assembled subcontigscovering an estimated 73.6% of the humangenome. The Scaffolder then proceeded touse mate-pair information to link these to-gether into scaffolds. When there are two ormore mate pairs that imply that a given pairof U-unitigs are at a certain distance andorientation with respect to each other, theprobability of this being wrong is againroughly 1 in 1010, assuming that mate pairsare false less than 2% of the time. Thus, onecan with high confidence link together allU-unitigs that are linked by at least two 2- or10-kbp mate pairs producing intermediate-sized scaffolds that are then recursivelylinked together by confirming 50-kbp matepairs and BAC end sequences. This processyielded scaffolds that are on the order ofmegabase pairs in size with gaps betweentheir contigs that generally correspond to re-petitive elements and occasionally to smallsequencing gaps. These scaffolds reconstructthe majority of the unique sequence within agenome.

For the Drosophila assembly, we engagedin a three-stage repeat resolution strategywhere each stage was progressively more

aggressive and thus more likely to make amistake. For the human assembly, we contin-ued to use the first “Rocks” substage whereall unitigs with a good, but not definitive,discriminator score are placed in a scaffoldgap. This was done with the condition thattwo or more mate pairs with one of theirreads already in the scaffold unambiguouslyplace the unitig in the given gap. We estimatethe probability of inserting a unitig into anincorrect gap with this strategy to be less than1027 based on a probabilistic analysis.

We revised the ensuing “Stones” substageof the human assembly, making it more likethe mechanism suggested in our earlier work(43). For each gap, every read R that is placedin the gap by virtue of its mated pair M beingin a contig of the scaffold and implying R’splacement is collected. Celera’s mate-pairinginformation is correct more than 99% of thetime. Thus, almost every, but not all, of thereads in the set belong in the gap, and whena read does not belong it rarely agrees withthe remainder of the reads. Therefore, wesimply assemble this set of reads within thegap, eliminating any reads that conflict withthe assembly. This operation proved muchmore reliable than the one it replaced for theDrosophila assembly; in the assembly of asimulated shotgun data set of human chromo-

Fig. 4. Architecture of Celera’s two-pronged assembly strategy. Each oval denotes a computationprocess performing the function indicated by its label, with the labels on arcs between ovalsdescribing the nature of the objects produced and/or consumed by a process. This figuresummarizes the discussion in the text that defines the terms and phrases used.

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1311

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 10: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

some 22, all stones were placed correctly.The final method of resolving gaps is to

fill them with assembled BAC data that coverthe gap. We call this external gap “walking.”We did not include the very aggressive “Peb-bles” substage described in our Drosophilawork, which made enough mistakes so as toproduce repeat reconstructions for long inter-spersed elements whose quality was only99.62% correct. We decided that for the hu-man genome it was philosophically better notto introduce a step that was certain to produceless than 99.99% accuracy. The cost was asomewhat larger number of gaps of some-what larger size.

At the final stage of the assembly process,and also at several intermediate points, aconsensus sequence of every contig is pro-duced. Our algorithm is driven by the princi-ple of maximum parsimony, with quality-value–weighted measures for evaluating eachbase. The net effect is a Bayesian estimate ofthe correct base to report at each position.Consensus generation uses Celera data when-ever it is present. In the event that no Celeradata cover a given region, the BAC datasequence is used.

A key element of achieving a WGA of thehuman genome was to parallelize the Overlap-per and the central consensus sequence–con-structing subroutines. In addition, memory wasa real issue—a straightforward application ofthe software we had built for Drosophila would

have required a computer with a 600-gigabyteRAM. By making the Overlapper and Unitiggerincremental, we were able to achieve the samecomputation with a maximum of instantaneoususage of 28 gigabytes of RAM. Moreover, theincremental nature of the first three stages al-lowed us to continually update the state of thispart of the computation as data were deliveredand then perform a 7-day run to complete Scaf-folding and Repeat Resolution whenever de-sired. For our assembly operations, the totalcompute infrastructure consists of 10 four-pro-cessor SMPs with 4 gigabytes of memory percluster (Compaq’s ES40, Regatta) and a 16-processor NUMA machine with 64 gigabytesof memory (Compaq’s GS160, Wildfire). Thetotal compute for a run of the assembler wasroughly 20,000 CPU hours.

The assembly of Celera’s data, togetherwith the shredded bactig data, produced a set ofscaffolds totaling 2.848 Gbp in span and con-sisting of 2.586 Gbp of sequence. The chaff, orset of reads not incorporated in the assembly,numbered 11.27 million (26%), which is con-sistent with our experience for Drosophila.More than 84% of the genome was covered byscaffolds .100 kbp long, and these averaged91% sequence and 9% gaps with a total of2.297 Gbp of sequence. There were a total of93,857 gaps among the 1637 scaffolds .100kbp. The average scaffold size was 1.5 Mbp,the average contig size was 24.06 kbp, and theaverage gap size was 2.43 kbp, where the dis-

tribution of each was essentially exponential.More than 50% of all gaps were less than 500bp long, .62% of all gaps were less than 1 kbplong, and no gap was .100 kbp long. Similar-ly, more than 65% of the sequence is in contigs.30 kbp, more than 31% is in contigs .100kbp, and the largest contig was 1.22 Mbp long.Table 3 gives detailed summary statistics forthe structure of this assembly with a directcomparison to the compartmentalized shotgunassembly.

2.4 Compartmentalized shotgunassemblyIn addition to the WGA approach, we pur-sued a localized assembly approach that wasintended to subdivide the genome into seg-ments, each of which could be shotgun as-sembled individually. We expected that thiswould help in resolution of large interchro-mosomal duplications and improve the statis-tics for calculating U-unitigs. The compart-mentalized assembly process involved clus-tering Celera reads and bactigs into large,multiple megabase regions of the genome,and then running the WGA assembler on theCelera data and shredded, faux reads ob-tained from the bactig data.

The first phase of the CSA strategy was toseparate Celera reads into those that matchedthe BAC contigs for a particular PFP BACentry, and those that did not match any publicdata. Such matches must be guaranteed to

Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies.

Scaffold size

All .30 kbp .100 kbp .500 kbp .1000 kbp

Compartmentalized shotgun assembly

No. of bp in scaffolds 2,905,568,203 2,748,892,430 2,700,489,906 2,489,357,260 2,248,689,128(including intrascaffold gaps)

No. of bp in contigs 2,653,979,733 2,524,251,302 2,491,538,372 2,320,648,201 2,106,521,902No. of scaffolds 53,591 2,845 1,935 1,060 721No. of contigs 170,033 112,207 107,199 93,138 82,009No. of gaps 116,442 109,362 105,264 92,078 81,288No. of gaps #1 kbp 72,091 69,175 67,289 59,915 53,354Average scaffold size (bp) 54,217 966,219 1,395,602 2,348,450 3,118,848Average contig size (bp) 15,609 22,496 23,242 24,916 25,686Average intrascaffold gap size

(bp)2,161 2,054 1,985 1,832 1,749

Largest contig (bp) 1,988,321 1,988,321 1,988,321 1,988,321 1,988,321% of total contigs 100 95 94 87 79

Whole-genome assembly

No. of bp in scaffolds(including intrascaffold gaps)

2,847,890,390 2,574,792,618 2,525,334,447 2,328,535,466 2,140,943,032

No. of bp in contigs 2,586,634,108 2,334,343,339 2,297,678,935 2,143,002,184 1,983,305,432No. of scaffolds 118,968 2,507 1,637 818 554No. of contigs 221,036 99,189 95,494 84,641 76,285No. of gaps 102,068 96,682 93,857 83,823 75,731No. of gaps #1 kbp 62,356 60,343 59,156 54,079 49,592Average scaffold size (bp) 23,938 1,027,041 1,542,660 2,846,620 3,864,518Average contig size (bp) 11,702 23,534 24,061 25,319 25,999Average intrascaffold gap size

(bp)2,560 2,487 2,426 2,213 2,082

Largest contig (bp) 1,224,073 1,224,073 1,224,073 1,224,073 1,224,073% of total contigs 100 90 89 83 77

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1312

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 11: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

properly place a Celera read, so all reads werefirst masked against a library of commonrepetitive elements, and only matches of atleast 40 bp to unmasked portions of the readconstituted a hit. Of Celera’s 27.27 millionreads, 20.76 million matched a bactig andanother 0.62 million reads, which did nothave any matches, were nonetheless identi-fied as belonging in the region of the bactig’sBAC because their mate matched the bactig.Of the remaining reads, 2.92 million werecompletely screened out and so could not bematched, but the other 2.97 million reads hadunmasked sequence totaling 1.189 Gbp thatwere not found in the GenBank data set.Because the Celera data are 5.113 redundant,we estimate that 240 Mbp of unique Celerasequence is not in the GenBank data set.

In the next step of the CSA process, acombining assembler took the relevant 53Celera reads and bactigs for a BAC entry, andproduced an assembly of the combined datafor that locale. These high-quality sequencereconstructions were a transient result whoseutility was simply to provide more reliableinformation for the purposes of their tilinginto sets of overlapping and adjacent scaffoldsequences in the next step. In outline, thecombining assembler first examines the set ofmatching Celera reads to determine if thereare excessive pileups indicative of un-screened repetitive elements. Wherever theseoccur, reads in the repeat region whose mateshave not been mapped to consistent positionsare removed. Then all sets of mate pairs thatconsistently imply the same relative positionof two bactigs are bundled into a link andweighted according to the number of mates inthe bundle. A “greedy” strategy then attemptsto order the bactigs by selecting bundles ofmate-pairs in order of their weight. A selectedmate-pair bundle can tie together two forma-tive scaffolds. It is incorporated to form asingle scaffold only if it is consistent with themajority of links between contigs of the scaf-fold. Once scaffolding is complete, gaps arefilled by the “Stones” strategy describedabove for the WGA assembler.

The GenBank data for the Phase 1 and 2BACs consisted of an average of 19.8 bactigsper BAC of average size 8099 bp. Applica-tion of the combining assembler resulted inindividual Celera BAC assemblies being puttogether into an average of 1.83 scaffolds(median of 1 scaffold) consisting of an aver-age of 8.57 contigs of average size 18,973 bp.In addition to defining order and orientationof the sequence fragments, there were 57%fewer gaps in the combined result. For Phase0 data, the average GenBank entry consistedof 91.52 reads of average length 784 bp.Application of the combining assembler re-sulted in an average of 54.8 scaffolds consist-ing of an average of 58.1 contigs of averagesize 873 bp. Basically, some small amount of

assembly took place, but not enough Celeradata were matched to truly assemble the 0.53to 13 data set represented by the typicalPhase 0 BACs. The combining assemblerwas also applied to the Phase 3 BACs forSNP identification, confirmation of assem-bly, and localization of the Celera reads. Thephase 0 data suggest that a combined whole-genome shotgun data set and 13 light-shot-gun of BACs will not yield good assembly ofBAC regions; at least 33 light-shotgun ofeach BAC is needed.

The 5.89 million Celera fragments notmatching the GenBank data were assembledwith our whole-genome assembler. The as-sembly resulted in a set of scaffolds totaling442 Mbp in span and consisting of 326 Mbpof sequence. More than 20% of the scaffoldswere .5 kbp long, and these averaged 63%sequence and 27% gaps with a total of 302Mbp of sequence. All scaffolds .5 kbp wereforwarded along with all scaffolds producedby the combining assembler to the subse-quent tiling phase.

At this stage, we typically had one or twoscaffolds for every BAC region constitutingat least 95% of the relevant sequence, and acollection of disjoint Celera-unique scaffolds.The next step in developing the genome com-ponents was to determine the order and over-lap tiling of these BAC and Celera-uniquescaffolds across the genome. For this, weused Celera’s 50-kbp mate-pairs information,and BAC-end pairs (18) and sequence taggedsite (STS) markers (44 ) to provide long-range guidance and chromosome separation.Given the relatively manageable number ofscaffolds, we chose not to produce this tilingin a fully automated manner, but to computean initial tiling with a good heuristic and thenuse human curators to resolve discrepanciesor missed join opportunities. To this end, wedeveloped a graphical user interface that dis-played the graph of tiling overlaps and theevidence for each. A human curator couldthen explore the implication of mapped STSdata, dot-plots of sequence overlap, and avisual display of the mate-pair evidence sup-porting a given choice. The result of thisprocess was a collection of “components,”where each component was a tiled set ofBAC and Celera-unique scaffolds that hadbeen curator-approved. The process resultedin 3845 components with an estimated spanof 2.922 Gbp.

In order to generate the final CSA, weassembled each component with the WGAalgorithm. As was done in the WGA process,the bactig data were shredded into a synthetic23 shotgun data set in order to give theassembler the freedom to independently as-semble the data. By using faux reads ratherthan bactigs, the assembly algorithm couldcorrect errors in the assembly of bactigs andremove chimeric content in a PFP data entry.

Chimeric or contaminating sequence (fromanother part of the genome) would not beincorporated into the reassembly of the com-ponent because it did not belong there. Ineffect, the previous steps in the CSA processserved only to bring together Celera frag-ments and PFP data relevant to a large con-tiguous segment of the genome, wherein weapplied the assembler used for WGA to pro-duce an ab initio assembly of the region.

WGA assembly of the components result-ed in a set of scaffolds totaling 2.906 Gbp inspan and consisting of 2.654 Gbp of se-quence. The chaff, or set of reads not incor-porated into the assembly, numbered 6.17million, or 22%. More than 90.0% of thegenome was covered by scaffolds spanning.100 kbp long, and these averaged 92.2%sequence and 7.8% gaps with a total of 2.492Gbp of sequence. There were a total of105,264 gaps among the 107,199 contigs thatbelong to the 1940 scaffolds spanning .100kbp. The average scaffold size was 1.4 Mbp,the average contig size was 23.24 kbp, andthe average gap size was 2.0 kbp where eachdistribution of sizes was exponential. Assuch, averages tend to be underrepresentativeof the majority of the data. Figure 5 shows ahistogram of the bases in scaffolds of varioussize ranges. Consider also that more than49% of all gaps were ,500 bp long, morethan 62% of all gaps were ,1 kbp, and allgaps are ,100 kbp long. Similarly, more than73% of the sequence is in contigs . 30 kbp,more than 49% is in contigs .100 kbp, andthe largest contig was 1.99 Mbp long. Table 3provides summary statistics for the structureof this assembly with a direct comparison tothe WGA assembly.

2.5 Comparison of the WGA and CSAscaffoldsHaving obtained two assemblies of the hu-man genome via independent computationalprocesses (WGA and CSA), we comparedscaffolds from the two assemblies as anothermeans of investigating their completeness,consistency, and contiguity. From each as-sembly, a set of reference scaffolds contain-ing at least 1000 fragments (Celera sequenc-ing reads or bactig shreds) was obtained; thisamounted to 2218 WGA scaffolds and 1717CSA scaffolds, for a total of 2.087 Gbp and2.474 Gbp. The sequence of each referencescaffold was compared to the sequence of allscaffolds from the other assembly with whichit shared at least 20 fragments or at least 20%of the fragments of the smaller scaffold. Foreach such comparison, all matches of at least200 bp with at most 2% mismatch weretabulated.

From this tabulation, we estimated theamount of unique sequence in each assemblyin two ways. The first was to determine thenumber of bases of each assembly that were

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1313

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 12: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

not covered by a matching segment in theother assembly. Some 82.5 Mbp of the WGA(3.95%) was not covered by the CSA, where-as 204.5 Mbp (8.26%) of the CSA was notcovered by the WGA. This estimate did notrequire any consistency of the assemblies orany uniqueness of the matching segments.Thus, another analysis was conducted inwhich matches of less than 1 kbp between apair of scaffolds were excluded unless theywere confirmed by other matches having aconsistent order and orientation. This givessome measure of consistent coverage: 1.982Gbp (95.00%) of the WGA is covered by theCSA, and 2.169 Gbp (87.69%) of the CSA iscovered by the WGA by this more stringentmeasure.

The comparison of WGA to CSA alsopermitted evaluation of scaffolds for structur-al inconsistencies. We looked for instances inwhich a large section of a scaffold from oneassembly matched only one scaffold from theother assembly, but failed to match over thefull length of the overlap implied by thematching segments. An initial set of candi-dates was identified automatically, and theneach candidate was inspected by hand. Fromthis process, we identified 31 instances inwhich the assemblies appear to disagree in anonlocal fashion. These cases are being fur-ther evaluated to determine which assemblyis in error and why.

In addition, we evaluated local inconsis-tencies of order or orientation. The followingresults exclude cases in which one contig inone assembly corresponds to more than oneoverlapping contig in the other assembly (aslong as the order and orientation of the latteragrees with the positions they match in theformer). Most of these small rearrangementsinvolved segments on the order of hundredsof base pairs and rarely .1 kbp. We found atotal of 295 kbp (0.012%) in the CSA assem-blies that were locally inconsistent with theWGA assemblies, whereas 2.108 Mbp(0.11%) in the WGA assembly were incon-sistent with the CSA assembly.

The CSA assembly was a few percentagepoints better in terms of coverage and slightlymore consistent than the WGA, because itwas in effect performing a few thousand shot-gun assemblies of megabase-sized problems,whereas the WGA is performing a shotgunassembly of a gigabase-sized problem. Whenone considers the increase of two-and-a-halforders of magnitude in problem size, the in-formation loss between the two is remarkablysmall. Because CSA was logistically easier todeliver and the better of the two results avail-able at the time when downstream analysesneeded to be begun, all subsequent analysiswas performed on this assembly.

2.6 Mapping scaffolds to the genomeThe final step in assembling the genome was toorder and orient the scaffolds on the chromo-somes. We first grouped scaffolds together onthe basis of their order in the components fromCSA. These grouped scaffolds were reorderedby examining residual mate-pairing data be-tween the scaffolds. We next mapped the scaf-fold groups onto the chromosome using physi-cal mapping data. This step depends on havingreliable high-resolution map information suchthat each scaffold will overlap multiple mark-ers. There are two genome-wide types of mapinformation available: high-density STS mapsand fingerprint maps of BAC clones developedat Washington University (45). Among the ge-nome-wide STS maps, GeneMap99 (GM99)has the most markers and therefore was mostuseful for mapping scaffolds. The two differentmapping approaches are complementary to oneanother. The fingerprint maps should have bet-ter local order because they were built by com-parison of overlapping BAC clones. On theother hand, GM99 should have a more reliablelong-range order, because the framework mark-ers were derived from well-validated geneticmaps. Both types of maps were used as areference for human curation of the compo-nents that were the input to the regional assem-bly, but they did not determine the order ofsequences produced by the assembler.

In order to determine the effectiveness ofthe fingerprint maps and GM99 for mappingscaffolds, we first examined the reliability ofthese maps by comparison with large scaf-folds. Only 1% of the STS markers on the 10largest scaffolds (those .9 Mbp) weremapped on a different chromosome onGM99. Two percent of the STS markers dis-agreed in position by more than five frame-work bins. However, for the fingerprintmaps, a 2% chromosome discrepancy wasobserved, and on average 23.8% of BAClocations in the scaffold sequence disagreedwith fingerprint map placement by more thanfive BACs. When further examining thesource of discrepancy, it was found that mostof the discrepancy came from 4 of the 10scaffolds, indicating this there is variation inthe quality of either the map or the scaffolds.All four scaffolds were assembled, as well asthe other six, as judged by clone coverageanalysis, and showed the same low discrep-ancy rate to GM99, and thus we concludedthat the fingerprint map global order in thesecases was not reliable. Smaller scaffolds hada higher discordance rate with GM99 (4.21%of STSs were discordant by more than fiveframework bins), but a lower discordance ratewith the fingerprint maps (11% of BACsdisagreed with fingerprint maps by more thanfive BACs). This observation agrees with theclone coverage analysis (46 ) that Celera scaf-fold construction was better supported bylong-range mate pairs in larger scaffolds thanin small scaffolds.

We created two orderings of Celera scaf-folds on the basis of the markers (BAC orSTS) on these maps. Where the order ofscaffolds agreed between GM99 and theWashU BAC map, we had a high degree ofconfidence that that order was correct; thesescaffolds were termed “anchor scaffolds.”Only scaffolds with a low overall discrepancyrate with both maps were considered anchorscaffolds. Scaffolds in GM99 bins were al-lowed to permute in their order to matchWashU ordering, provided they did not vio-late their framework orders. Orientation ofindividual scaffolds was determined by thepresence of multiple mapped markers withconsistent order. Scaffolds with only onemarker have insufficient information to as-sign orientation. We found 70.1% of the ge-nome in anchored scaffolds, more than 99%of which are also oriented (Table 4). BecauseGM99 is of lower resolution than the WashUmap, a number of scaffolds without STSmatches could be ordered relative to the an-chored scaffolds because they included se-quence from the same or adjacent BACs onthe WashU map. On the other hand, becauseof occasional WashU global ordering dis-crepancies, a number of scaffolds determinedto be “unmappable” on the WashU map couldbe ordered relative to the anchored scaffolds

Fig. 5. Distribution of scaffold sizes of the CSA. For each range of scaffold sizes, the percent of totalsequence is indicated.

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1314

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 13: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

with GM99. These scaffolds were termed“ordered scaffolds.” We found that 13.9% ofthe assembly could be ordered by these ad-ditional methods, and thus 84.0% of the ge-nome was ordered unambiguously.

Next, all scaffolds that could be placed,but not ordered, between anchors were as-signed to the interval between the anchoredscaffolds and were deemed to be “bound-ed” between them. For example, small scaf-folds having STS hits from the same Gene-Map bin or hitting the same BAC cannot beordered relative to each other, but can beassigned a placement boundary relative toother anchored or ordered scaffolds. Theremaining scaffolds either had no localiza-tion information, conflicting information,or could only be assigned to a genericchromosome location. Using the above ap-proaches, ;98% of the genome was an-chored, ordered, or bounded.

Finally, we assigned a location for eachscaffold placed on the chromosome byspreading out the scaffolds per chromosome.We assumed that the remaining unmappedscaffolds, constituting 2% of the genome,were distributed evenly across the genome.By dividing the sum of unmapped scaffoldlengths with the sum of the number ofmapped scaffolds, we arrived at an estimateof interscaffold gap of 1483 bp. This gap wasused to separate all the scaffolds on eachchromosome and to assign an offset in thechromosome.

During the scaffold-mapping effort, we en-countered many problems that resulted in addi-tional quality assessment and validation analy-sis. At least 978 (3% of 33,173) BACs werebelieved to have sequence data from more thanone location in the genome (47). This is con-sistent with the bactig chimerism analysis re-ported above in the Assembly Strategies sec-tion. These BACs could not be assigned tounique positions within the CSA assembly andthus could not be used for ordering scaffolds.Likewise, it was not always possible to assignSTSs to unique locations in the assembly be-cause of genome duplications, repetitive ele-ments, and pseudogenes.

Because of the time required for an ex-haustive search for a perfect overlap, CSAgenerated 21,607 intrascaffold gaps wherethe mate-pair data suggested that the contigsshould overlap, but no overlap was found.These gaps were defined as a fixed 50 bp inlength and make up 18.6% of the total116,442 gaps in the CSA assembly.

We chose not to use the order of exonsimplied in cDNA or EST data as a way ofordering scaffolds. The rationale for not us-ing this data was that doing so would havebiased certain regions of the assembly byrearranging scaffolds to fit the transcript dataand made validation of both the assembly andgene definition processes more difficult.

2.7 Assembly and validation analysisWe analyzed the assembly of the genomefrom the perspectives of completeness(amount of coverage of the genome) andcorrectness (the structural accuracy of theorder and orientation and the consensus se-quence of the assembly).

Completeness. Completeness is defined asthe percentage of the euchromatic sequencerepresented in the assembly. This cannot beknown with absolute certainty until the eu-chromatin sequence has been completed.However, it is possible to estimate complete-ness on the basis of (i) the estimated sizes ofintrascaffold gaps; (ii) coverage of the twopublished chromosomes, 21 and 22 (48, 49);and (iii) analysis of the percentage of anindependent set of random sequences (STSmarkers) contained in the assembly. Thewhole-genome libraries contain heterochro-matic sequence and, although no attempt hasbeen made to assemble it, there may be in-stances of unique sequence embedded in re-gions of heterochromatin as were observed inDrosophila (50, 51).

The sequences of human chromosomes 21and 22 have been completed to high qualityand published (48, 49). Although this se-quence served as input to the assembler, thefinished sequence was shredded into a shot-gun data set so that the assembler had theopportunity to assemble it differently fromthe original sequence in the case of structuralpolymorphisms or assembly errors in theBAC data. In particular, the assembler mustbe able to resolve repetitive elements at thescale of components (generally multimega-base in size), and so this comparison revealsthe level to which the assembler resolvesrepeats. In certain areas, the assembly struc-ture differs from the published versions ofchromosomes 21 and 22 (see below). Theconsequence of the flexibility to assemble“finished” sequence differently on the basisof Celera data resulted in an assembly withmore segments than the chromosome 21 and22 sequences. We examined the reasons whythere are more gaps in the Celera sequencethan in chromosomes 21 and 22 and expectthat they may be typical of gaps in otherregions of the genome. In the Celera assem-bly, there are 25 scaffolds, each containing atleast 10 kb of sequence, that collectively span94.3% of chromosome 21. Sixty-two scaf-folds span 95.7% of chromosome 22. Thetotal length of the gaps remaining in theCelera assembly for these two chromosomesis 3.4 Mbp. These gap sequences were ana-lyzed by RepeatMasker and by searchingagainst the entire genome assembly (52).About 50% of the gap sequence consisted ofcommon repetitive elements identified by Re-peatMasker; more than half of the remainderwas lower copy number repeat elements.

A more global way of assessing complete-

ness is to measure the content of an independentset of sequence data in the assembly. We com-pared 48,938 STS markers from Genemap99(51) to the scaffolds. Because these markerswere not used in the assembly processes, theyprovided a truly independent measure of com-pleteness. ePCR (53) and BLAST (54) wereused to locate STSs on the assembled genome.We found 44,524 (91%) of the STSs in themapped genome. An additional 2648 markers(5.4%) were found by searching the unas-sembled data or “chaff.” We identified 1283STS markers (2.6%) not found in either Celerasequence or BAC data as of September 2000,raising the possibility that these markers maynot be of human origin. If that were the case,the Celera assembled sequence would represent93.4% of the human genome and the unas-sembled data 5.5%, for a total of 98.9% cover-age. Similarly, we compared CSA against36,678 TNG radiation hybrid markers (55a)using the same method. We found that 32,371markers (88%) were located in the mappedCSA scaffolds, with 2055 markers (5.6%)found in the remainder. This gave a 94% cov-erage of the genome through another genome-wide survey.

Correctness. Correctness is defined as thestructural and sequence accuracy of the as-sembly. Because the source sequences for theCelera data and the GenBank data are fromdifferent individuals, we could not directlycompare the consensus sequence of the as-

Table 4. Summary of scaffold mapping. Scaffoldswere mapped to the genome with different levelsof confidence (anchored scaffolds have the highestconfidence; unmapped scaffolds have the lowest).Anchored scaffolds were consistently ordered bythe WashU BAC map and GM99. Ordered scaf-folds were consistently ordered by at least one ofthe following: the WashU BAC map, GM99, orcomponent tiling path. Bounded scaffolds had or-der conflicts between at least two of the externalmaps, but their placements were adjacent to aneighboring anchored or ordered scaffold. Un-mapped scaffolds had, at most, a chromosomeassignment. The scaffold subcategories are givenbelow each category.

Mappedscaffoldcategory

Number Length (bp)%

Totallength

Anchored 1,526 1,860,676,676 70Oriented 1,246 1,852,088,645 70Unoriented 280 8,588,031 0.3

Ordered 2,001 369,235,857 14Oriented 839 329,633,166 12Unoriented 1,162 39,602,691 2

Bounded 38,241 368,753,463 14Oriented 7,453 274,536,424 10Unoriented 30,788 94,217,039 4

Unmapped 11,823 55,313,737 2Known 281 2,505,844 0.1

chromosomeUnknown

chromosome11,542 52,807,893 2

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1315

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 14: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

sembly against other finished sequence fordetermining sequencing accuracy at the nu-cleotide level, although this has been done foridentifying polymorphisms as described inSection 6. The accuracy of the consensussequence is at least 99.96% on the basis of astatistical estimate derived from the qualityvalues of the underlying reads.

The structural consistency of the assemblycan be measured by mate-pair analysis. In acorrect assembly, every mated pair of se-quencing reads should be located on the con-sensus sequence with the correct separationand orientation between the pairs. A pair istermed “valid” when the reads are in thecorrect orientation and the distance betweenthem is within the mean 6 3 standard devi-ations of the distribution of insert sizes of thelibrary from which the pair was sampled. Apair is termed “misoriented” when the readsare not correctly oriented, and is termed “mis-separated” when the distance between thereads is not in the correct range but the readsare correctly oriented. The mean 6 the stan-dard deviation of each library used by theassembler was determined as describedabove. To validate these, we examined allreads mapped to the finished sequence ofchromosome 21 (48) and determined howmany incorrect mate pairs there were as aresult of laboratory tracking errors and chi-merism (two different segments of the ge-nome cloned into the same plasmid), and howtight the distribution of insert sizes was for

those that were correct (Table 5). The stan-dard deviations for all Celera libraries werequite small, less than 15% of the insertlength, with the exception of a few 50-kbplibraries. The 2- and 10-kbp libraries con-tained less than 2% invalid mate pairs, where-as the 50-kbp libraries were somewhat higher(;10%). Thus, although the mate-pair infor-mation was not perfect, its accuracy was suchthat measuring valid, misoriented, and mis-separated pairs with respect to a given assem-bly was deemed to be a reliable instrumentfor validation purposes, especially when sev-eral mate pairs confirm or deny an ordering.

The clone coverage of the genome was393, meaning that any given base pair was,on average, contained in 39 clones or, equiv-alently, spanned by 39 mate-paired reads.Areas of low clone coverage or areas with ahigh proportion of invalid mate pairs wouldindicate potential assembly problems. Wecomputed the coverage of each base in theassembly by valid mate pairs (Table 6). Insummary, for scaffolds .30 kbp in length,less than 1% of the Celera assembly was inregions of less than 33 clone coverage. Thus,more than 99% of the assembly, includingorder and orientation, is strongly supportedby this measure alone.

We examined the locations and number ofall misoriented and misseparated mates. Inaddition to doing this analysis on the CSAassembly (as of 1 October 2000), we alsoperformed a study of the PFP assembly as of

5 September 2000 (30, 55b). In this lattercase, Celera mate pairs had to be mapped tothe PFP assembly. To avoid mapping errorsdue to high-fidelity repeats, the only pairsmapped were those for which both readsmatched at only one location with less than6% differences. A threshold was set such thatsets of five or more simultaneously invalidmate pairs indicated a potential breakpoint,where the construction of the two assembliesdiffered. The graphic comparison of the CSAchromosome 21 assembly with the publishedsequence (Fig. 6A) serves as a validation ofthis methodology. Blue tick marks in thepanels indicate breakpoints. There were asimilar (small) number of breakpoints onboth chromosome sequences. The exceptionwas 12 sets of scaffolds in the Celera assem-bly (a total of 3% of the chromosome lengthin 212 single-contig scaffolds) that weremapped to the wrong positions because theywere too small to be mapped reliably. Figures6 and 7 and Table 6 illustrate the mate-pairdifferences and breakpoints between the twoassemblies. There was a higher percentage ofmisoriented and misseparated mate pairs inthe large-insert libraries (50 kbp and BACends) than in the small-insert libraries in bothassemblies (Table 6). The large-insert librar-ies are more likely to identify discrepanciessimply because they span a larger segment ofthe genome. The graphic comparison be-tween the two assemblies for chromosome 8(Fig. 6, B and C) shows that there are many

Table 5. Mate-pair validation. Celera fragment sequences were mapped tothe published sequence of chromosome 21. Each mate pair uniquelymapped was evaluated for correct orientation and placement (number

of mate pairs tested). If the two mates had incorrect relative orienta-tion or placement, they were considered invalid (number of invalid matepairs).

Librarytype

Libraryno.

Chromosome 21 Genome

Meaninsertsize(bp)

SD(bp)

SD/mean(%)

No. ofmatepairs

tested

No. ofinvalidmatepairs

%invalid

Meaninsert

size (bp)

SD(bp)

SD/mean(%)

2 kbp 1 2,081 106 5.1 3,642 38 1.0 2,082 90 4.32 1,913 152 7.9 28,029 413 1.5 1,923 118 6.13 2,166 175 8.1 4,405 57 1.3 2,162 158 7.3

10 kbp 4 11,385 851 7.5 4,319 80 1.9 11,370 696 6.15 14,523 1,875 12.9 7,355 156 2.1 14,142 1,402 9.96 9,635 1,035 10.7 5,573 109 2.0 9,606 934 9.77 10,223 928 9.1 34,079 399 1.2 10,190 777 7.6

50 kbp 8 64,888 2,747 4.2 16 1 6.3 65,500 5,504 8.49 53,410 5,834 10.9 914 170 18.6 53,311 5,546 10.4

10 52,034 7,312 14.1 5,871 569 9.7 51,498 6,588 12.811 52,282 7,454 14.3 2,629 213 8.1 52,282 7,454 14.312 46,616 7,378 15.8 2,153 215 10.0 45,418 9,068 20.013 55,788 10,099 18.1 2,244 249 11.1 53,062 10,893 20.514 39,894 5,019 12.6 199 7 3.5 36,838 9,988 27.1

BES 15 48,931 9,813 20.1 144 10 6.9 47,845 4,774 10.016 48,130 4,232 8.8 195 14 7.2 47,924 4,581 9.617 106,027 27,778 26.2 330 16 4.8 152,000 26,600 17.518 160,575 54,973 34.2 155 8 5.2 161,750 27,000 16.719 164,155 19,453 11.9 642 44 6.9 176,500 19,500 11.05

Sum 102,894 2,768 2.7(mean 5 2.7)

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1316

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 15: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

more breakpoints for the PFP assembly thanfor the Celera assembly. Figure 7 shows thebreakpoint map (blue tick marks) for bothassemblies of each chromosome in a side-by-side fashion. The order and orientation ofCelera’s assembly shows substantially fewerbreakpoints except on the two finished chro-mosomes. Figure 7 also depicts large gaps(.10 kbp) in both assemblies as red tickmarks. In the CSA assembly, the size of allgaps have been estimated on the basis of themate-pair data. Breakpoints can be caused bystructural polymorphisms, because the twoassemblies were derived from different hu-man genomes. They also reflect the unfin-ished nature of both genome assemblies.

3 Gene Prediction and AnnotationSummary. To enumerate the gene inventory,we developed an integrated, evidence-basedapproach named Otto. The evidence used toincrease the likelihood of identifying genesincludes regions conserved between themouse and human genomes, similarity toESTs or other mRNA-derived data, or simi-larity to other proteins. A comparison of Otto(combined Otto-RefSeq and Otto homology)with Genscan, a standard gene-prediction al-gorithm, showed greater sensitivity (0.78 ver-sus 0.50) and specificity (0.93 versus 0.63) ofOtto in the ability to define gene structure.Otto-predicted genes were complementedwith a set of genes from three gene-predictionprograms that exhibited weaker, but still sig-nificant, evidence that they may be ex-pressed. Conservative criteria, requiring atleast two lines of evidence, were used todefine a set of 26,383 genes with good con-fidence that were used for more detailed anal-ysis presented in the subsequent sections.Extensive manual curation to establish pre-cise characterization of gene structure will benecessary to improve the results from thisinitial computational approach.

3.1 Automated gene annotationA gene is a locus of cotranscribed exons. Asingle gene may give rise to multiple tran-scripts, and thus multiple distinct proteinswith multiple functions, by means of alterna-

tive splicing and alternative transcription ini-tiation and termination sites. Our cells areable to discern within the billions of basepairs of the genomic DNA the signals forinitiating transcription and for splicing to-gether exons separated by a few or hundredsof thousands of base pairs. The first step incharacterizing the genome is to define thestructure of each gene and each transcriptionunit.

The number of protein-coding genes inmammals has been controversial from theoutset. Initial estimates based on reassocia-tion data placed it between 30,000 to 40,000,whereas later estimates from the brain were.100,000 (56 ). More recent data from boththe corporate and public sectors, based onextrapolations from EST, CpG island, andtranscript density–based extrapolations, havenot reduced this variance. The highest recentnumber of 142,634 genes emanates from areport from Incyte Pharmaceuticals, and isbased on a combination of EST data and theassociation of ESTs with CpG islands (57 ).In stark contrast are three quite different, andmuch lower estimates: one of ;35,000 genesderived with genome-wide EST data andsampling procedures in conjunction withchromosome 22 data (58); another of 28,000to 34,000 genes derived with a comparativemethodology involving sequence conserva-tion between humans and the puffer fish Te-traodon nigroviridis (59); and a figure of35,000 genes, which was derived simply byextrapolating from the density of 770 knownand predicted genes in the 67 Mbp of chro-mosomes 21 and 22, to the approximately3-Gbp euchromatic genome.

The problem of computational identifica-tion of transcriptional units in genomic DNAsequence can be divided into two phases. Thefirst is to partition the sequence into segmentsthat are likely to correspond to individualgenes. This is not trivial and is a weakness ofmost de novo gene-finding algorithms. It isalso critical to determining the number ofgenes in the human gene inventory. The sec-ond challenge is to construct a gene modelthat reflects the probable structure of thetranscript(s) encoded in the region. This can

be done with reasonable accuracy when afull-length cDNA has been sequenced or ahighly homologous protein sequence isknown. De novo gene prediction, althoughless accurate, is the only way to find genesthat are not represented by homologous pro-teins or ESTs. The following section de-scribes the methods we have developed toaddress these problems for the prediction ofprotein-coding genes.

We have developed a rule-based expert sys-tem, called Otto, to identify and characterizegenes in the human genome (60). Otto attemptsto simulate in software the process that a humanannotator uses to identify a gene and refine itsstructure. In the process of annotating a regionof the genome, a human curator examines theevidence provided by the computational pipe-line (described below) and examines how var-ious types of evidence relate to one another. Acurator puts different levels of confidence indifferent types of evidence and looks forcertain patterns of evidence to support geneannotation. For example, a curator may ex-amine homology to a number of ESTs andevaluate whether or not they can be connect-ed into a longer, virtual mRNA. The curatorwould also evaluate the strength of the simi-larity and the contiguity of the match, inessence asking whether any ESTs crosssplice-junctions and whether the edges ofputative exons have consensus splice sites.This kind of manual annotation process wasused to annotate the Drosophila genome.

The Otto system can promote observedevidence to a gene annotation in one of twoways. First, if the evidence includes a high-quality match to the sequence of a knowngene [here defined as a human gene repre-sented in a curated subset of the RefSeqdatabase (61)], then Otto can promote this toa gene annotation. In the second method, Ottoevaluates a broad spectrum of evidence anddetermines if this evidence is adequate tosupport promotion to a gene annotation.These processes are described below.

Initially, gene boundaries are predicted onthe basis of examination of sets of overlap-ping protein and EST matches generated by acomputational pipeline (62). This pipelinesearches the scaffold sequences against pro-tein, EST, and genome-sequence databases todefine regions of sequence similarity andruns three de novo gene-prediction programs.

To identify likely gene boundaries, re-gions of the genome were partitioned by Ottoon the basis of sequence matches identifiedby BLAST. Each of the database sequencesmatched in the region under analysis wascompared by an algorithm that takes intoaccount both coordinates of the matching se-quence, as well as the sequence type (e.g.,protein, EST, and so forth). The results wereused to group the matches into bins of relatedsequences that may define a gene and identify

Table 6. Genome-wide mate pair analysis of compartmentalized shotgun (CSA) and PFP assemblies.*

Genomelibrary

CSA PFP

%valid

%mis-

oriented

%mis-

separated†

%valid

%mis-

oriented

%mis-

separated†

2 kbp 98.5 0.6 1.0 95.7 2.0 2.310 kbp 96.7 1.0 2.3 81.9 9.6 8.650 kbp 93.9 4.5 1.5 64.2 22.3 13.5BES 94.1 2.1 3.8 62.0 19.3 18.8Mean 97.4 1.0 1.6 87.3 6.8 5.9

*Data for individual chromosomes can be found in Web fig. 3 on Science Online at www.sciencemag.org/cgi/content/full/291/5507/1304/DC1. †Mates are misseparated if their distance is .3 SD from the mean library size.

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1317

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 16: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

gene boundaries. During this process, multiplehits to the same region were collapsed to acoherent set of data by tracking the coverage ofa region. For example, if a group of bases wasrepresented by multiple overlapping ESTs, theunion of these regions matched by the set ofESTs on the scaffold was marked as beingsupported by EST evidence. This resulted in aseries of “gene bins,” each of which was be-lieved to contain a single gene. One weakness ofthis initial implementation of the algorithm wasin predicting gene boundaries in regions of tan-demly duplicated genes. Gene clusters frequent-ly resulted in homologous neighboring genes

being joined together, resulting in an annotationthat artificially concatenated these gene models.

Next, known genes (those with exact match-es of a full-length cDNA sequence to the ge-nome) were identified, and the region corre-sponding to the cDNA was annotated as apredicted transcript. A subset of the curat-ed human gene set RefSeq from the Nation-al Center for Biotechnology Information(NCBI) was included as a data set searched inthe computational pipeline. If a RefSeq tran-script matched the genome assembly for at least50% of its length at .92% identity, then theSIM4 (63) alignment of the RefSeq transcript to

the region of the genome under analysis waspromoted to the status of an Otto annotation.Because the genome sequence has gaps andsequence errors such as frameshifts, it was notalways possible to predict a transcript thatagrees precisely with the experimentally deter-mined cDNA sequence. A total of 6538 genesin our inventory were identified and transcriptspredicted in this way.

Regions that have a substantial amount ofsequence similarity, but do not match knowngenes, were analyzed by that part of the Ottosystem that uses the sequence similarity in-formation to predict a transcript. Here, Otto

Fig. 6. Comparison of the CSA and the PFP assembly.(A) All of chromosome 21, (B) all of chromosome 8,and (C) a 1-Mb region of chromosome 8 representinga single Celera scaffold. To generate the figure, Celerafragment sequences were mapped onto each assem-bly. The PFP assembly is indicated in the upper thirdof each panel; the Celera assembly is indicated in thelower third. In the center of the panel, green linesshow Celera sequences that are in the same order andorientation in both assemblies and form the longestconsistently ordered run of sequences. Yellow linesindicate sequence blocks that are in the same orien-tation, but out of order. Red lines indicate sequenceblocks that are not in the same orientation. Forclarity, in the latter two cases, lines are only drawnbetween segments of matching sequence that are atleast 50 kbp long. The top and bottom thirds of eachpanel show the extent of Celera mate-pair violations(red, misoriented; yellow, incorrect distance betweenthe mates) for each assembly grouped by library size.(Mate pairs that are within the correct distance, asexpected from the mean library insert size, are omit-ted from the figure for clarity.) Predicted breakpoints,corresponding to stacks of violated mate pairs of thesame type, are shown as blue ticks on each assemblyaxis. Runs of more than 10,000 Ns are shown as cyanbars. Plots of all 24 chromosomes can be seen in Webfig. 3 on Science Online at www.sciencemag.org/cgi/content/full/291/5507/1304/DC1.

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1318

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 17: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

evaluates evidence generated by the compu-tational pipeline, corresponding to conserva-tion between mouse and human genomicDNA, similarity to human transcripts (ESTs

and cDNAs), similarity to rodent transcripts(ESTs and cDNAs), and similarity of thetranslation of human genomic DNA to knownproteins to predict potential genes in the hu-

man genome. The sequence from the regionof genomic DNA contained in a gene bin wasextracted, and the subsequences supported byany homology evidence were marked (plus 100

Fig. 7. Schematic view of the distribution of breakpoints and large gapson all chromosomes. For each chromosome, the upper pair of linesrepresent the PFP assembly, and the lower pair of lines represent Celera’s

assembly. Blue tick marks represent breakpoints, whereas red tick marksrepresent a gap of larger than 10,000 bp. The number of breakpoints perchromosome is indicated in black, and the chromosome numbers in red.

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1319

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 18: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

bases flanking these regions). The other basesin the region, those not covered by any homol-ogy evidence, were replaced by N’s. This se-quence segment, with high confidence regionsrepresented by the consensus genomic se-quence and the remainder represented by N’s,was then evaluated by Genscan to see if aconsistent gene model could be generated. Thisprocedure simplified the gene-prediction taskby first establishing the boundary for the gene(not a strength of most gene-finding algo-rithms), and by eliminating regions with nosupporting evidence. If Genscan returned aplausible gene model, it was further evaluatedbefore being promoted to an “Otto” annotation.The final Genscan predictions were often quitedifferent from the prediction that Genscan re-turned on the same region of native genomicsequence. A weakness of using Genscan torefine the gene model is the loss of valid, smallexons from the final annotation.

The next step in defining gene structuresbased on sequence similarity was to compareeach predicted transcript with the homology-based evidence that was used in previous stepsto evaluate the depth of evidence for each exonin the prediction. Internal exons were consid-ered to be supported if they were covered byhomology evidence to within 610 bases oftheir edges. For first and last exons, the internaledge was required to be within 10 bases, but theexternal edge was allowed greater latitude toallow for 59 and 39 untranslated regions(UTRs). To be retained, a prediction for amulti-exon gene must have evidence such thatthe total number of “hits,” as defined above,divided by the number of exons in the predic-tion must be .0.66 or must correspond to aRefSeq sequence. A single-exon gene must becovered by at least three supporting hits (610bases on each side), and these must cover thecomplete predicted open reading frame. Fora single-exon gene, we also required thatthe Genscan prediction include both a startand a stop codon. Gene models that did notmeet these criteria were disregarded, and

those that passed were promoted to Ottopredictions. Homology-based Otto predic-tions do not contain 39 and 59 untranslatedsequence. Although three de novo gene-findingprograms [GRAIL, Genscan, and FgenesH(63)] were run as part of the computationalanalysis, the results of these programs were notdirectly used in making the Otto predictions.Otto predicted 11,226 additional genes bymeans of sequence similarity.

3.2 Otto validationTo validate the Otto homology-based processand the method that Otto uses to define thestructures of known genes, we compared tran-scripts predicted by Otto with their correspond-ing (and presumably correct) transcript from aset of 4512 RefSeq transcripts for which therewas a unique SIM4 alignment (Table 7). Inorder to evaluate the relative performance ofOtto and Genscan, we made three comparisons.The first involved a determination of the accu-racy of gene models predicted by Otto withonly homology data other than the correspond-ing RefSeq sequence (Otto homology in Table7). We measured the sensitivity (correctly pre-dicted bases divided by the total length of thecDNA) and specificity (correctly predictedbases divided by the sum of the correctly andincorrectly predicted bases). Second, we exam-ined the sensitivity and specificity of the Ottopredictions that were made solely with the Ref-Seq sequence, which is the process that Ottouses to annotate known genes (Otto-RefSeq).And third, we determined the accuracy of theGenscan predictions corresponding to theseRefSeq sequences. As expected, the alignmentmethod (Otto-RefSeq) was the most accurate,and Otto-homology performed better than Gen-scan by both criteria. Thus, 6.1% of true RefSeqnucleotides were not represented in the Otto-refseq annotations and 2.7% of the nucleotidesin the Otto-RefSeq transcripts were not con-tained in the original RefSeq transcripts. Thediscrepancies could come from legitimatedifferences between the Celera assemblyand the RefSeq transcript due to polymor-phisms, incomplete or incorrect data in theCelera assembly, errors introduced by Sim4during the alignment process, or the pres-ence of alternatively spliced forms in thedata set used for the comparisons.

Because Otto uses an evidence-based ap-proach to reconstruct genes, the absence ofexperimental evidence for intervening exonsmay inadvertantly result in a set of exons thatcannot be spliced together to give rise to atranscript. In such cases, Otto may “split genes”when in fact all the evidence should be com-bined into a single transcript. We also examinedthe tendency of these methods to incorrectlysplit gene predictions. These trends are shownin Fig. 8. Both RefSeq and homology-basedpredictions by Otto split known genes into few-er segments than Genscan alone.

3.3 Gene numberRecognizing that the Otto system is quiteconservative, we used a different gene-pre-diction strategy in regions where the ho-mology evidence was less strong. Here theresults of de novo gene predictions wereused. For these genes, we insisted that apredicted transcript have at least two of thefollowing types of evidence to be includedin the gene set for further analysis: protein,human EST, rodent EST, or mouse genomefragment matches. This final class of pre-dicted genes is a subset of the predictionsmade by the three gene-finding programsthat were used in the computational pipe-line. For these, there was not sufficientsequence similarity information for Otto toattempt to predict a gene structure. Thethree de novo gene-finding programs re-sulted in about 155,695 predictions, ofwhich ;76,410 were nonredundant (non-overlapping with one another). Of these,57,935 did not overlap known genes orpredictions made by Otto. Only 21,350 ofthe gene predictions that did not overlapOtto predictions were partially supportedby at least one type of sequence similarityevidence, and 8619 were partially support-ed by two types of evidence (Table 8).

The sum of this number (21,350) and thenumber of Otto annotations (17,764), 39,114,is near the upper limit for the human genecomplement. As seen in Table 8, if the re-quirement for other supporting evidence ismade more stringent, this number drops rap-idly so that demanding two types of evidencereduces the total gene number to 26,383 anddemanding three types reduces it to ;23,000.Requiring that a prediction be supported byall four categories of evidence is too stringentbecause it would eliminate genes that encodenovel proteins (members of currently unde-scribed protein families). No correction forpseudogenes has been made at this point inthe analysis.

In a further attempt to identify genes thatwere not found by the autoannotation processor any of the de novo gene finders, we ex-amined regions outside of gene predictionsthat were similar to the EST sequence, andwhere the EST matched the genomic se-quence across a splice junction. After correct-ing for potential 39 UTRs of predicted genes,about 2500 such regions remained. Additionof a requirement for at least one of the fol-lowing evidence types—homology to mousegenomic sequence fragments, rodent ESTs,or cDNAs—or similarity to a known proteinreduced this number to 1010. Adding this tothe numbers from the previous paragraphwould give us estimates of about 40,000,27,000, and 24,000 potential genes in thehuman genome, depending on the stringencyof evidence considered. Table 8 illustrates thenumber of genes and presents the degree of

Table 7. Sensitivity and specificity of Otto andGenscan. Sensitivity and specificity were calculat-ed by first aligning the prediction to the publishedRefSeq transcript, tallying the number (N) ofuniquely aligned RefSeq bases. Sensitivity is theratio of N to the length of the published RefSeqtranscript. Specificity is the ratio of N to thelength of the prediction. All differences are signif-icant (Tukey HSD; P , 0.001).

Method Sensitivity Specificity

Otto (RefSeq only)* 0.939 0.973Otto (homology)† 0.604 0.884Genscan 0.501 0.633

*Refers to those annotations produced by Otto using onlythe Sim4-polished RefSeq alignment rather than an evi-dence-based Genscan prediction. †Refers to thoseannotations produced by supplying all available evidenceto Genscan.

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1320

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 19: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

confidence based on the supporting evidence.Transcripts encoded by a set of 26,383 geneswere assembled for further analysis. This setincludes the 6538 genes predicted by Otto onthe basis of matches to known genes, 11,226transcripts predicted by Otto based on homol-ogy evidence, and 8619 from the subset oftranscripts from de novo gene-prediction pro-grams that have two types of supporting ev-idence. The 26,383 genes are illustrated alongchromosome diagrams in Fig. 1. These are avery preliminary set of annotations and aresubject to all the limitations of an automatedprocess. Considerable refinement is still nec-essary to improve the accuracy of these tran-script predictions. All the predictions anddescriptions of genes and the associated evi-dence that we present are the product ofcompletely computational processes, not ex-pert curation. We have attempted to enumer-ate the genes in the human genome in such away that we have different levels of confi-dence based on the amount of supportingevidence: known genes, genes with good pro-tein or EST homology evidence, and de novogene predictions confirmed by modest ho-mology evidence.

3.4 Features of human genetranscriptsWe estimate the average span for a “typi-cal” gene in the human DNA sequence tobe about 27,894 bases. This is based on theaverage span covered by RefSeq tran-scripts, used because it represents our high-est confidence set.

The set of transcripts promoted to geneannotations varies in a number of ways. Ascan be seen from Table 8 and Fig. 9, tran-scripts predicted by Otto tend to be longer,having on average about 7.8 exons, whereasthose promoted from gene-prediction pro-grams average about 3.7 exons. The largestnumber of exons that we have identified in atranscript is 234 in the titin mRNA. Table 8compares the amounts of evidence that sup-

port the Otto and other predicted transcripts.For example, one can see that a typical Ottotranscript has 6.99 of its 7.81 exons supportedby protein homology evidence. As would beexpected, the Otto transcripts generally havemore support than do transcripts predicted bythe de novo methods.

4 Genome StructureSummary. This section describes several ofthe noncoding attributes of the assembledgenome sequence and their correlations withthe predicted gene set. These include an anal-ysis of G1C content and gene density in thecontext of cytogenetic maps of the genome,an enumerative analysis of CpG islands, anda brief description of the genome-wide repet-itive elements.

4.1 Cytogenetic mapsPerhaps the most obvious, and certainly themost visible, element of the structure ofthe genome is the banding pattern producedby Giemsa stain. Chromosomal bandingstudies have revealed that about 17% to20% of the human chromosome comple-ment consists of C-bands, or constitutiveheterochromatin (64 ). Much of this hetero-chromatin is highly polymorphic and con-sists of different families of alpha satelliteDNAs with various higher order repeatstructures (65). Many chromosomes havecomplex inter- and intrachromosomal du-plications present in pericentromeric re-gions (66 ). About 5% of the sequence readswere identified as alpha satellite sequences;these were not included in the assembly.

Fig. 8. Analysis of split genes resulting from different annotation methods. A set of 4512Sim4-based alignments of RefSeq transcripts to the genomic assembly were chosen (see the textfor criteria), and the numbers of overlapping Genscan, Otto (RefSeq only) annotations based solelyon Sim4-polished RefSeq alignments, and Otto (homology) annotations (annotations produced bysupplying all available evidence to Genscan) were tallied. These data show the degree to whichmultiple Genscan predictions and/or Otto annotations were associated with a single RefSeqtranscript. The zero class for the Otto-homology predictions shown here indicates that theOtto-homology calls were made without recourse to the RefSeq transcript, and thus no Otto callwas made because of insufficient evidence.

Table 8. Numbers of exons and transcripts supported by various types of evidence for Otto and de novo gene prediction methods. Highlighted cells indicatethe gene sets analyzed in this paper (boldface, set of genes selected for protein analysis; italic, total set of accepted de novo predictions).

TotalTypes of evidence No. of lines of evidence*

Mouse Rodent Protein Human $1 $2 $3 $4

Otto Number oftranscripts

17,969 17,065 14,881 15,477 16,374 17,968† 17,501 15,877 12,451

Number ofexons

141,218 111,174 89,569 108,431 118,869 140,710 127,955 99,574 59,804

De novo Number oftranscripts

58,032 14,463 5,094 8,043 9,220 21,350 8,619 4,947 1,904

Number ofexons

319,935 48,594 19,344 26,264 40,104 79,148 31,130 17,508 6,520

No. of exons per Otto 7.84 5.77 6.01 6.99 7.24 7.81 7.19 6.00 4.28transcript De novo 5.53 3.17 3.80 3.27 4.36 3.7 3.56 3.42 3.16

*Four kinds of evidence (conservation in 33 mouse genomic DNA, similarity to human EST or cDNA, similarity to rodent EST or cDNA, and similarity to known proteins) wereconsidered to support gene predictions from the different methods. The use of evidence is quite liberal, requiring only a partial match to a single exon of predicted transcript. †Thisnumber includes alternative splice forms of the 17,764 genes mentioned elsewhere in the text.

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1321

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 20: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

Examination of pericentromeric regions isongoing.

The remaining ;80% of the genome, theeuchromatic component, is divisible into G-,R-, and T-bands (67). These cytogenetic bandshave been presumed to differ in their nucleotidecomposition and gene density, although wehave been unable to determine precise bandboundaries at the molecular level. T-bands arethe most G1C- and gene-rich, and G-bands areG1C-poor (68). Bernardi has also offered adescription of the euchromatin at the molecularlevel as long stretches of DNA of differing basecomposition, termed isochores (denoted L, H1,H2, and H3), which are .300 kbp in length(69). Bernardi defined the L (light) isochores asG1C-poor (,43%), whereas the H (heavy)isochores fall into three G1C-rich classes rep-resenting 24, 8, and 5% of the genome. Geneconcentration has been claimed to be very lowin the L isochores and 20-fold more enriched inthe H2 and H3 isochores (70). By examiningcontiguous 50-kbp windows of G1C contentacross the assembly, we found that regions ofG1C content .48% (H3 isochores) averaged273.9 kbp in length, those with G1C contentbetween 43 and 48% (H11H2 isochores) aver-aged 202.8 kbp in length, and the average spanof regions with ,43% (L isochores) was1078.6 kbp. The correlation between G1Ccontent and gene density was also examined in50-kbp windows along the assembled sequence(Table 9 and Figs. 10 and 11). We found thatthe density of genes was greater in regions ofhigh G1C than in regions of low G1C content,as expected. However, the correlation betweenG1C content and gene density was not asskewed as previously predicted (69). A higherproportion of genes were located in the G1C-poor regions than had been expected.

Chromosomes 17, 19, and 22, which havea disproportionate number of H3-containingbands, had the highest gene density (Table10). Conversely, of the chromosomes that we

found to have the lowest gene density, X, 4,18, 13, and Y, also have the fewest H3 bands.Chromosome 15, which also has few H3bands, did not have a particularly low genedensity in our analysis. In addition, chromo-some 8, which we found to have a low genedensity, does not appear to be unusual in itsH3 banding.

How valid is Ohno’s postulate (71) thatmammalian genomes consist of oases of genesin otherwise essentially empty deserts? It ap-pears that the human genome does indeed con-tain deserts, or large, gene-poor regions. If wedefine a desert as a region .500 kbp without agene, then we see that 605 Mbp, or about 20%of the genome, is in deserts. These are notuniformly distributed over the various chromo-somes. Gene-rich chromosomes 17, 19, and 22have only about 12% of their collective 171Mbp in deserts, whereas gene-poor chromo-somes 4, 13, 18, and X have 27.5% of their 492Mbp in deserts (Table 11). The apparent lack ofpredicted genes in these regions does not nec-essarily imply that they are devoid of biologicalfunction.

4.2 Linkage mapLinkage maps provide the basis for geneticanalysis and are widely used in the study of theinheritance of traits and in the positional clon-ing of genes. The distance metric, centimorgans(cM), is based on the recombination rate be-tween homologous chromosomes during meio-

sis. In general, the rate of recombination infemales is greater than that in males, and thisdegree of map expansion is not uniform acrossthe genome (72). One of the opportunities en-abled by a nearly complete genome sequence isto produce the ultimate physical map, and tofully analyze its correspondence with two othermaps that have been widely used in genomeand genetic analysis: the linkage map and thecytogenetic map. This would close the loopbetween the mapping and sequencing phases ofthe genome project.

We mapped the location of the markersthat constitute the Genethon linkage map tothe genome. The rate of recombination, ex-pressed as cM per Mbp, was calculated for3-Mbp windows as shown in Table 12. High-er rates of recombination in the telomericregion of the chromosomes have been previ-ously documented (73). From this mappingresult, there is a difference of 4.99 betweenlowest rates and highest rates and the largestdifference of 4.4 between males and females(4.99 to 0.47 on chromosome 16). This indi-cates that the variability in recombinationrates among regions of the genome exceedsthe differences in recombination rates be-tween males and females. The human ge-nome has recombination hotspots, where re-combination rates vary fivefold or more overa space of 1 kbp, so the picture one gets of themagnitude of variability in recombinationrate will depend on the size of the window

Fig. 9. Comparison ofthe number of exonsper transcript betweenthe 17,968 Otto tran-scripts and 21,350 denovo transcript predic-tions with at least oneline of evidence thatdo not overlap with anOtto prediction. Bothsets have the highestnumber of transcriptsin the two-exon cate-gory, but the de novogene predictions areskewed much moretoward smaller tran-scripts. In the Otto set,19.7% of the tran-scripts have one ortwo exons, and 5.7%have more than 20. In the de novo set, 49.3% of the transcripts have one or two exons, and 0.2% have more than 20.

Table 9. Characteristics of G1C in isochores.

Isochore G1C (%)Fraction of genome Fraction of genes

Predicted* Observed Predicted* Observed

H3 .48 5 9.5 37 24.8H1/H2 43–48 25 21.2 32 26.6L ,43 67 69.2 31 48.5

*The predictions were based on Bernardi’s definitions (70) of the isochore structure of the human genome.

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1322

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 21: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

examined. Unfortunately, too few meioticcrossovers have occurred in Centre d’Etudedu Polymorphism Humain (CEPH) and otherreference families to provide a resolution anyfiner than about 3 Mbp. The next challengewill be to determine a sequence basis ofrecombination at the chromosomal level. Anaccurate predictor for the rate for variation inrecombination rates between any pair ofmarkers would be extremely useful in design-ing markers to narrow a region of linkage,such as in positional cloning projects.

4.3 Correlation between CpG islandsand genesCpG islands are stretches of unmethylatedDNA with a higher frequency of CpGdinucleotides when compared with the entiregenome (74 ). CpG islands are believed topreferentially occur at the transcriptional startof genes, and it has been observed that mosthousekeeping genes have CpG islands at the59 end of the transcript (75, 76 ). In addition,experimental evidence indicates that CpG is-land methylation is correlated with gene in-activation (77 ) and has been shown to beimportant during gene imprinting (78) andtissue-specific gene expression (79)

Experimental methods have been usedthat resulted in an estimate of 30,000 to45,000 CpG islands in the human genome(74, 80) and an estimate of 499 CpG islandson human chromosome 22 (81). Larsen etal. (76 ) and Gardiner-Garden and Frommer(75) used a computational method to iden-tify CpG islands and defined them as re-gions of DNA of .200 bp that have a G1Ccontent of .50% and a ratio of observed

versus expected frequency of CG dinucle-otide $0.6.

It is difficult to make a direct compari-son of experimental definitions of CpG is-lands with computational definitions be-cause computational methods do not con-sider the methylation state of cytosine andexperimental methods do not directly selectregions of high G1C content. However, wecan determine the correlation of CpG islandwith gene starts, given a set of annotatedgenomic transcripts and the whole genomesequence. We have analyzed the publiclyavailable annotation of chromosome 22, aswell as using the entire human genome inour assembly and the computationally an-notated genes. A variation of the CpG is-land computation was compared withLarsen et al. (76 ). The main differences arethat we use a sliding window of 200 bp,consecutive windows are merged only ifthey overlap, and we recompute the CpGvalue upon merging, thus rejecting any po-tential island if it scores less than thethreshold.

To compute various CpG statistics, weused two different thresholds of CG dinucle-otide likelihood ratio. Besides using the orig-inal threshold of 0.6 (method 1), we used ahigher threshold of CG dinucleotide likeli-hood ratio of 0.8 (method 2), which results inthe number of CpG islands on chromosome22 close to the number of annotated genes onthis chromosome. The main results are sum-marized in Table 13. CpG islands computedwith method 1 predicted only 2.6% of theCSA sequence as CpG, but 40% of the genestarts (start codons) are contained inside a

CpG island. This is comparable to ratios re-ported by others (82). The last two rows ofthe table show the observed and expectedaverage distance, respectively, of the closestCpG island from the first exon. The observedaverage closest CpG islands are smaller thanthe corresponding expected distances, con-firming an association between CpG islandand the first exon.

We also looked at the distribution of CpGisland nucleotides among various sequenceclasses such as intergenic regions, introns,exons, and first exons. We computed thelikelihood score for each sequence class asthe ratio of the observed fraction of CpGisland nucleotides in that sequence classand the expected fraction of CpG islandnucleotides in that sequence class. The re-sult of applying method 1 on CSA werescores of 0.89 for intergenic region, 1.2 forintron, 5.86 for exon, and 13.2 for firstexon. The same trend was also found forchromosome 22 and after the application ofa higher threshold (method 2) on both datasets. In sum, genome-wide analysis hasextended earlier analysis and suggests astrong correlation between CpG islands andfirst coding exons.

4.4 Genome-wide repetitive elementsThe proportion of the genome covered byvarious classes of repetitive DNA is present-ed in Table 14. We observed about 35% ofthe genome in these repeat classes, very sim-ilar to values reported previously (83). Repet-itive sequence may be underrepresented inthe Celera assembly as a result of incompleterepeat resolution, as discussed above. About8% of the scaffold length is in gaps, and weexpect that much of this is repetitive se-quence. Chromosome 19 has the highest re-peat density (57%), as well as the highestgene density (Table 10). Of interest, amongthe different classes of repeat elements, weobserve a clear association of Alu elementsand gene density, which was not observedbetween LINEs and gene density.

5 Genome EvolutionSummary. The dynamic nature of genomeevolution can be captured at several levels.These include gene duplications mediated byRNA intermediates (retrotransposition) andsegmental genomic duplications. In this sec-tion, we document the genome-wide occur-rence of retrotransposition events generatingfunctional (intronless paralogs) or inactivegenes (pseudogenes). Genes involved intranslational processes and nuclear regulationaccount for nearly 50% of all intronless para-logs and processed pseudogenes detected inour survey. We have also cataloged the extentof segmental genomic duplication and pro-vide evidence for 1077 duplicated blockscovering 3522 distinct genes.

Fig. 10. Relation between G1C content and gene density. The blue bars show the percent of thegenome (in 50-kbp windows) with the indicated G1C content. The percent of the total number ofgenes associated with each G1C bin is represented by the yellow bars. The graph shows that about5% of the genome has a G1C content of between 50 and 55%, but that this portion containsnearly 15% of the genes.

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1323

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 22: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

Fig. 11. Genome structural features.

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 0 SCIENCE www.sciencemag.org1324

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 23: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

5.1 Retrotransposition in the humangenome

Retrotransposition of processed mRNAtranscripts into the genome results in func-tional genes, called intronless paralogs, orinactivated genes ( pseudogenes). A paralogrefers to a gene that appears in more thanone copy in a given organism as a result of

a duplication event. The existence of bothintron-containing and intronless forms ofgenes encoding functionally similar oridentical proteins has been previously de-scribed (84, 85). Cataloging these evolu-tionary events on the genomic landscape isof value in understanding the functionalconsequences of such gene-duplication

events in cellular biology. Identification ofconserved intronless paralogs in the mouseor other mammalian genomes should pro-vide the basis for capturing the evolution-ary chronology of these transpositionevents and provide insights into gene lossand accretion in the mammalian radiation.

A set of proteins corresponding to all 901

Fig. 11 (continued). Relation among gene density (orange), G1C content(green), EST density (blue), and Alu density (pink) along the lengths ofeach of the chromosomes. Gene density was calculated in 1-Mbp win-

dows. The percent of G1C nucleotides was calculated in 100-kbpwindows. The number of ESTs and Alu elements is shown per 100-kbpwindow.

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1325

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 24: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

Table 10. Features of the chromosomes. De novo/any refers to the union of de novo predictions that do not overlap Otto predictions and have at least one other type of supporting evidence; de novo/2xrefers to the union of de novo predictions that do not overlap Otto predictions and have at least two types of evidence. Deserts are regions of sequence with no annotated genes.

Chr.

Sequence coverage (CS assembly) Base composition Gene prediction* Gene density (genes/Mbp)

Size(Mbp)

No. ofscaf-folds

Largestscaf-fold

(Mbp)

No. ofscaf-folds.500kbp

Se-quencecovered

byscaf-folds.500kbp

% oftotalse-

quencein

scaf-folds.500kbp

%repeat

%GC

No ofCpG

islandsOtto

Denovo/any

Denovo/

23

Total(Otto1 denovo/any)

Total(Otto1 denovo/any)

Se-quence

indeserts.500/

kbp

Se-quence

indeserts

.1Mbp

OttoDe

novo/any

Denovo/

23

Otto1 denovo/any

Otto1 denovo/

23

1 220 2,549 11 82 192 88 37 42 2,335 1,743 1,710 710 3,453 2,453 29 6 8 8 3 16 112 240 3,263 13 78 217 91 36 40 1,703 1,183 1,771 633 2,954 1,816 55 19 5 7 2 12 73 200 3,532 7 78 173 87 37 40 1,271 1,013 1,414 598 2,427 1,611 50 12 5 7 3 12 84 186 2,180 10 70 169 91 37 38 1,081 696 1,165 449 1,861 1,145 55 18 4 6 2 10 65 182 3,231 11 63 163 89 37 40 1,302 892 1,244 474 2,136 1,366 46 15 5 7 2 11 76 172 1,713 13 58 160 93 37 40 1,384 943 1,314 524 2,257 1,467 38 9 6 7 3 13 87 146 1,326 14 53 130 89 38 40 1,406 759 1,072 460 1,831 1,219 26 12 5 7 3 12 88 146 1,772 11 54 135 92 36 40 948 583 977 357 1,560 940 33 6 4 7 2 11 69 113 1,616 8 40 101 89 38 41 1,315 689 848 329 1,537 1,018 22 9 6 7 3 13 8

10 130 2,005 9 55 116 89 36 42 1,087 685 968 342 1,653 1,027 21 8 5 7 2 12 711 132 2,814 9 44 116 88 39 42 1,461 1,051 1,134 535 2,185 1,586 27 9 8 8 4 16 1212 134 2,614 8 51 117 87 38 41 1,131 925 936 417 1,861 1,342 24 9 7 7 3 14 1013 99 1,038 13 34 91 91 36 38 644 341 691 241 1,032 582 31 16 4 7 2 10 514 87 576 11 16 83 95 40 41 913 583 700 290 1,283 873 34 20 7 8 3 14 1015 80 1,747 8 31 70 87 37 42 722 558 640 246 1,198 804 8 1 7 8 3 15 1016 75 1,520 8 27 62 82 40 44 1,533 748 673 247 1,421 995 13 3 10 9 3 19 1217 78 1,683 6 40 61 78 39 45 1,489 897 648 313 1,545 1,210 15 6 12 8 4 19 1518 79 1,333 13 18 72 92 36 40 510 283 543 189 826 472 21 10 4 7 2 10 619 58 2,282 3 31 38 67 57 49 2,804 1,141 534 268 1,675 1,409 3 0 20 9 4 29 2320 61 580 14 17 58 94 41 44 997 517 469 180 986 697 7 1 8 7 3 16 1121 33 358 10 6 32 96 38 41 519 184 265 102 449 286 15 9 6 8 3 13 822 36 333 11 12 32 88 44 48 1,173 494 341 147 835 641 3 0 14 9 4 23 17X 128 1,346 4 91 93 73 46 39 726 605 860 387 1,465 992 29 8 5 6 3 11 7Y 19 638 2 10 12 65 50 39 65 55 155 49 210 104 4 2 3 8 2 11 5U* 75 11,542 1 479 196 278 132 474 328Total 2907 53,591 1,059 2,490 28,519 17,764 21,350 8,619 39,114 26,383 606 208Avg. 116 2,144 9 44 104 87 40 41 1,160 714 812 333 1,526 1,047 25 9 7 7 3 14 9

*Chromosomal assignment unknown.

TH

EH

UM

AN

GE

NO

ME

16FEBRU

ARY

2001V

OL

291SC

IENC

Ew

ww

.sciencemag.org

13

26

on August 31, 2009 www.sciencemag.orgDownloaded from

Page 25: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

Otto-predicted, single-exon genes were sub-jected to BLAST analysis against the proteinsencoded by the remaining multiexon predict-ed transcripts. Using homology criteria of70% sequence identity over 90% of thelength, we identified 298 instances of single-to multi-exon correspondence. Of these 298sequences, 97 were represented in the Gen-Bank data set of experimentally validatedfull-length genes at the stringency specifiedand were verified by manual inspection.

We believe that these 97 cases may rep-resent intronless paralogs (see Web table 1 onScience Online at www.sciencemag.org/cgi/content/full/291/5507/1304/DC1) of knowngenes. Most of these are flanked by directrepeat sequences, although the precise natureof these repeats remains to be determined. Allof the cases for which we have high confi-dence contain polyadenylated [poly(A)] tailscharacteristic of retrotransposition.

Recent publications describing the phe-nomenon of functional intronless paralogsspeculate that retrotransposition may serve asa mechanism used to escape X-chromosomalinactivation (84, 86 ). We do not find a biastoward X chromosome origination of theseretrotransposed genes; rather, the resultsshow a random chromosome distribution ofboth the intron-containing and correspondingintronless paralogs. We also have found sev-eral cases of retrotransposition from a singlesource chromosome to multiple target chro-mosomes. Interesting examples include theretrotransposition of a five exon–containingribosomal protein L21 gene on chromosome13 onto chromosomes 1, 3, 4, 7, 10, and 14,respectively. The size of the source genes canalso show variability. The largest example isthe 31-exon diacylglycerol kinase zeta geneon chromosome 11 that has an intronlessparalog on chromosome 13. Regardless ofroute, retrotransposition with subsequentgene changes in coding or noncoding regionsthat lead to different functions or expressionpatterns, represents a key route to providingan enhanced functional repertoire in mam-mals (87 ).

Our preliminary set of retrotransposed in-tronless paralogs contains a clear overrepre-sentation of genes involved in translationalprocesses (40% ribosomal proteins and 10%translation elongation factors) and nuclearregulation (HMG nonhistone proteins, 4%),as well as metabolic and regulatory enzymes.EST matches specific to a subset of intronlessparalogs suggest expression of these intron-less paralogs. Differences in the upstreamregulatory sequences between the sourcegenes and their intronless paralogs could ac-count for differences in tissue-specific geneexpression. Defining which, if any, of theseprocessed genes are functionally expressedand translated will require further elucidationand experimental validation.

5.2 PseudogenesA pseudogene is a nonfunctional copy that isvery similar to a normal gene but that hasbeen altered slightly so that it is not ex-

pressed. We developed a method for the pre-liminary analysis of processed pseudogenesin the human genome as a starting point inelucidating the ongoing evolutionary forces

Table 11. Genome overview.

Size of the genome (including gaps) 2.91 GbpSize of the genome (excluding gaps) 2.66 GbpLongest contig 1.99 MbpLongest scaffold 14.4 MbpPercent of A1T in the genome 54Percent of G1C in the genome 38Percent of undetermined bases in the genome 9Most GC-rich 50 kb Chr. 2 (66%)Least GC-rich 50 kb Chr. X (25%)Percent of genome classified as repeats 35Number of annotated genes 26,383Percent of annotated genes with unknown function 42Number of genes (hypothetical and annotated) 39,114Percent of hypothetical and annotated genes with unknown function 59Gene with the most exons Titin (234 exons)Average gene size 27 kbpMost gene-rich chromosome Chr. 19 (23 genes/Mb)Least gene-rich chromosomes Chr. 13 (5 genes/Mb),

Chr. Y (5 genes/Mb)Total size of gene deserts (.500 kb with no annotated genes) 605 MbpPercent of base pairs spanned by genes 25.5 to 37.8*Percent of base pairs spanned by exons 1.1 to 1.4*Percent of base pairs spanned by introns 24.4 to 36.4*Percent of base pairs in intergenic DNA 74.5 to 63.6*Chromosome with highest proportion of DNA in annotated exons Chr. 19 (9.33)Chromosome with lowest proportion of DNA in annotated exons Chr. Y (0.36)Longest intergenic region (between annotated 1 hypothetical genes) Chr. 13 (3,038,416 bp)Rate of SNP variation 1/1250 bp

*In these ranges, the percentages correspond to the annotated gene set (26, 383 genes) and the hypothetical 1annotated gene set (39,114 genes), respectively.

Table 12. Rate of recombination per physical distance (cM/Mb) across the genome. Genethon markerswere placed on CSA-mapped assemblies, and then relative physical distances and rates were calculatedin 3-Mb windows for each chromosome. NA, not applicable.

Chrom.Male Sex-average Female

Max. Avg. Min. Max. Avg. Min. Max. Avg. Min.

1 2.60 1.12 0.23 2.81 1.42 0.52 3.39 1.76 0.682 2.23 0.78 0.33 2.65 1.12 0.54 3.17 1.40 0.613 2.55 0.86 0.23 2.40 1.07 0.42 2.71 1.30 0.334 1.66 0.67 0.15 2.06 1.04 0.60 2.50 1.40 0.775 2.00 0.67 0.18 1.87 1.08 0.42 2.26 1.43 0.626 1.97 0.71 0.28 2.57 1.12 0.37 3.47 1.67 0.647 2.34 1.16 0.48 1.67 1.17 0.47 2.27 1.21 0.348 1.83 0.73 0.14 2.40 1.05 0.46 3.44 1.36 0.439 2.01 0.99 0.53 1.95 1.32 0.77 2.63 1.66 0.82

10 3.73 1.03 0.22 3.05 1.29 0.66 2.84 1.51 0.7611 1.43 0.72 0.31 2.13 0.99 0.47 3.10 1.32 0.4912 4.12 0.76 0.26 3.35 1.16 0.49 2.93 1.55 0.5913 1.60 0.75 0.01 1.87 0.95 0.17 2.49 1.19 0.3214 3.15 0.98 0.18 2.65 1.30 0.62 3.14 1.63 0.7515 2.28 0.94 0.34 2.31 1.22 0.42 2.53 1.56 0.5416 1.83 1.00 0.47 2.70 1.55 0.63 4.99 2.32 1.1217 3.87 0.87 0.00 3.54 1.35 0.54 4.19 1.83 0.9418 3.12 1.37 0.86 3.75 1.66 0.43 4.35 2.24 0.7219 3.02 0.97 0.10 2.57 1.41 0.49 2.89 1.75 0.8720 3.64 0.89 0.00 2.79 1.50 0.83 3.31 2.15 1.3421 3.23 1.26 0.69 2.37 1.62 1.08 2.58 1.90 1.1822 1.25 1.10 0.84 1.88 1.41 1.08 3.73 2.08 0.93X NA NA NA NA NA NA 3.12 1.64 0.72Y NA NA NA NA NA NA NA NA NA

Genome 4.12 0.88 0.00 3.75 1.22 0.17 4.99 1.55 0.32

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1327

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 26: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

that account for gene inactivation. The gen-eral structural characteristics of these pro-cessed pseudogenes include the completelack of intervening sequences found in thefunctional counterparts, a poly(A) tract at the39 end, and direct repeats flanking the pseu-dogene sequence. Processed pseudogenes oc-cur as a result of retrotransposition, whereasunprocessed pseudogenes arise from segmen-tal genome duplication.

We searched the complete set of Otto-predicted transcripts against the genomic se-quence by means of BLAST. Genomic re-gions corresponding to all Otto-predictedtranscripts were excluded from this analysis.We identified 2909 regions matching withgreater than 70% identity over at least 70% ofthe length of the transcripts that likely repre-sent processed pseudogenes. This number isprobably an underestimate because specificmethods to search for pseudogenes were notused.

We looked for correlations betweenstructural elements and the propensity forretrotransposition in the human genome.GC content and transcript length were com-pared between the genes with processed

pseudogenes (1177 source genes) versusthe remainder of the predicted gene set.Transcripts that give rise to processed pseu-dogenes have shorter average transcriptlength (1027 bp versus 1594 bp for the Ottoset) as compared with genes for which nopseudogene was detected. The overall GCcontent did not show any significant differ-ence, contrary to a recent report (88). Thereis a clear trend in gene families that arepresent as processed pseudogenes. Theseinclude ribosomal proteins (67%), laminreceptors (10%), translation elongation fac-tor alpha (5%), and HMG–non-histone pro-teins (2%). The increased occurrence ofretrotransposition (both intronless paralogsand processed pseudogenes) among genesinvolved in translation and nuclear regula-tion may reflect an increased transcription-al activity of these genes.

5.3 Gene duplication in the humangenomeBuilding on a previously published procedure(27 ), we developed a graph-theoretic algo-rithm, called Lek, for grouping the predictedhuman protein set into protein families (89).

The complete clusters that result from theLek clustering provide one basis for compar-ing the role of whole-genome or chromosom-al duplication in protein family expansion asopposed to other means, such as tandem du-plication. Because each complete cluster rep-resents a closed and certain island of homol-ogy, and because Lek is capable of simulta-neously clustering protein complements ofseveral organisms, the number of proteinscontributed by each organism to a completecluster can be predicted with confidence de-pending on the quality of the annotation ofeach genome. The variance of each organ-ism’s contribution to each cluster can then becalculated, allowing an assessment of the rel-ative importance of large-scale duplicationversus smaller-scale, organism-specific ex-pansion and contraction of protein families,presumably as a result of natural selectionoperating on individual protein families with-in an organism. As can be seen in Fig. 12, thelarge variance in the relative numbers of hu-man as compared with D. melanogaster andCaenorhabditis elegans proteins in completeclusters may be explained by multiple eventsof relative expansions in gene families ineach of the three animal genomes. Such ex-pansions would give rise to the distributionthat shows a peak at 1:1 in the ratio forhuman-worm or human-fly clusters with theslope spread covering both human and fly/worm predominance, as we observed (Fig.12). Furthermore, there are nearly as manyclusters where worm and fly proteins pre-dominate despite the larger numbers of pro-teins in the human. At face value, this anal-ysis suggests that natural selection acting onindividual protein families has been a majorforce driving the expansion of at least someelements of the human protein set. However,in our analysis, the difference between anancient whole-genome duplication followedby loss, versus piecemeal duplication, cannotbe easily distinguished. In order to differen-tiate these scenarios, more extended analyseswere performed.

5.4 Large-scale duplicationsUsing two independent methods, wesearched for large-scale duplications in thehuman genome. First, we describe a proteinfamily– based method that identified highlyconserved blocks of duplication. We thendescribe our comprehensive method for identi-fying all interchromosomal block duplications.The latter method identified a large number ofduplicated chromosomal segments coveringparts of all 24 chromosomes.

The first of the methods is based on theidea of searching for blocks of highly con-served homologous proteins that occur inmore than one location on the genome. Forthis comparison, two genes were consideredequivalent if their protein products were de-

Table 13. Characteristics of CpG islands identified in chromosome 22 (34-Mbp sequence length) and thewhole genome (2.9-Gbp sequence length) by means of two different methods. Method 1 uses a CGlikelihood ratio of $0.6. Method 2 uses a CG likelihood ratio of $0.8.

Chromosome 22Whole genome(CS assembly)

Method 1 Method 2 Method 1 Method 2

Number of CpG islandsdetected

5,211 522 195,706 26,876

Average length of island (bp) 390 535 395 497

Percent of sequencepredicted as CpG

5.9 0.8 2.6 0.4

Percent of first exons thatoverlap a CpG island

44 25 42 22

Percent of first exons withfirst position of exoncontained inside a CpGisland

37 22 40 21

Average distance betweenfirst exon and closest CpGisland (bp)

1,013 10,486 2,182 17,021

Expected distance betweenfirst exon and closest CpGisland (bp)

3,262 32,567 7,164 55,811

Table 14. Distribution of repetitive DNA in the compartmentalized shotgun assembly sequence.

Repetitive elementsMegabases in

assembledsequences

Percentof

assembly

Previouslypredicted(%) (83)

Alu 288 9.9 10.0Mammalian interspersed repeat (MIR) 66 2.3 1.7Medium reiteration (MER) 50 1.7 1.6Long terminal repeat (LTR) 155 5.3 5.6Long interspersed nucleotide element

(LINE)466 16.1 16.7

Total 1025 35.3 35.6

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1328

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 27: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

termined to be in the same family and thesame complete Lek cluster (essentiallyparalogous genes) (89). Initially, each chro-mosome was represented as a string of genesordered by the start codons for predictedgenes along the chromosome. We consideredthe two strands as a single string, becauselocal inversions are relatively common eventsrelative to large-scale duplications. Eachgene was indexed according to the proteinfamily and Lek complete cluster (89). Allpairs of indexed gene strings were thenaligned in both the forward and reverse di-rections with the Smith-Waterman algorithm(90). A match between two proteins of thesame Lek complete cluster was given a scoreof 10 and a mismatch 210, with gap openand extend penalties of 24 and 21. Withthese parameters, 19 conserved interchromo-somal blocks of duplication were observed,all of which were also detected and expandedby the comprehensive method described be-low. The detection of only a relatively smallnumber of block duplications was a conse-quence of using an intrinsically conservativemethod grounded in the conservative con-straints of the complete Lek clusters.

In the second, more comprehensive ap-proach, we aligned all chromosomes directlywith one another using an algorithm based onthe MUMmer system (91). This alignmentmethod uses a suffix tree data structure and alinear-time algorithm to align long sequencesvery rapidly; for example, two chromosomesof 100 Mbp can be aligned in less than 20min (on a Compaq Alpha computer) with 4gigabytes of memory. This procedure wasused recently to identify numerous large-scale segmental duplications among the fivechromosomes of A. thaliana (92); in thatorganism, the method revealed that 60% ofthe genome (66 Mbp) is covered by 24 verylarge duplicated segments. For Arabidopsis, aDNA-based alignment was sufficient to re-veal the segmental duplications betweenchromosomes; in the human genome, DNAalignments at the whole-chromosome levelare insufficiently sensitive. Therefore, a mod-ified procedure was developed and applied,as follows. First, all 26,588 proteins(9,675,713 million amino acids) were concat-enated end-to-end in order as they occuralong each of the 24 chromosomes, irrespec-tive of strand location. The concatenated pro-tein set was then aligned against each chro-mosome by the MUMmer algorithm. Theresulting matches were clustered to extract allsets of three or more protein matches thatoccur in close proximity on two differentchromosomes (93); these represent the can-didate segmental duplications. A series offilters were developed and applied to removelikely false-positives from this set; for exam-ple, small blocks that were spread acrossmany proteins were removed. To refine the

filtering methods, a shuffled protein set wasfirst created by taking the 26,588 proteins,randomizing their order, and then partitioningthem into 24 shuffled chromosomes, eachcontaining the same number of proteins as thetrue genome. This shuffled protein set has theidentical composition to the real genome; inparticular, every protein and every domainappears the same number of times. The com-plete algorithm was then applied to both thereal and the shuffled data, with the results onthe shuffled data being used to estimate thefalse-positive rate. The algorithm after filter-ing yielded 10,310 gene pairs in 1077 dupli-cated blocks containing 3522 distinct genes;tandemly duplicated expansions in many ofthe blocks explain the excess of gene pairs todistinct genes. In the shuffled data, by con-trast, only 370 gene pairs were found, givinga false-positive estimate of 3.6%. The mostlikely explanation for the 1077 block dupli-cations is ancient segmental duplications. Inmany cases, the order of the proteins has beenshuffled, although proximity is preserved.Out of the 1077 blocks, 159 contain onlythree genes, 137 contain four genes, and 781contain five or more genes.

To illustrate the extent of the detectedduplications, Fig. 13 shows all 1077 blockduplications indexed to each chromosome in24 panels in which only duplications mappedto the indexed chromosome are displayed.The figure makes it clear that the duplicationsare ubiquitous in the genome. One featurethat it displays is many relatively small chro-mosomal stretches, with one-to-many dupli-cation relationships that are graphically strik-ing. One such example captured by the anal-ysis is the well-documented olfactory recep-tor (OR) family, which is scattered in blocksthroughout the genome and which has beenanalyzed for genome-deployment reconstruc-

tions at several evolutionary stages (94 ). Thefigure also illustrates that some chromo-somes, such as chromosome 2, contain manymore detected large-scale duplications thanothers. Indeed, one of the largest duplicatedsegments is a large block of 33 proteins onchromosome 2, spread among eight smallerblocks in 2p, that aligns to a paralogous set onchromosome 14, with one rearrangement (seechromosomes 2 and 14 panels in Fig. 13).The proteins are not contiguous but span aregion containing 97 proteins on chromo-some 2 and 332 proteins on chromosome 14.The likelihood of observing this many dupli-cated proteins by chance, even over a span ofthis length, is 2.3 3 10268 (93). This dupli-cated set spans 20 Mbp on chromosome 2 and63 Mbp on chromosome 14, over 70% of thelatter chromosome. Chromosome 2 also con-tains a block duplication that is nearly aslarge, which is shared by chromosome arm 2qand chromosome 12. This duplication incor-porates two of the four known Hox geneclusters, but considerably expands the extentof the duplications proximally and distally onthe pair of chromosome arms. This breadth ofduplication is also seen on the two chromo-somes carrying the other two Hox clusters.

An additional large duplication, betweenchromosomes 18 and 20, serves as a goodexample to illustrate some of the featurescommon to many of the other observed largeduplications (Fig. 13, inset). This duplicationcontains 64 detected ordered intrachromo-somal pairs of homologous genes. After dis-counting a 40-Mb stretch of chromosome 18free of matches to chromosome 20, which islikely to represent a large insert (between thegene assignments “Krup rel” and “collagenrel” on chromosome 18 in Fig. 13), the fullduplication segment covers 36 Mb on chro-mosome 18 and 28 Mb on chromosome 20.

Fig. 12. Gene duplication in complete protein clusters. The predicted protein sets of human, worm,and fly were subjected to Lek clustering (27). The numbers of clusters with varying ratios (wholenumber) of human versus worm and human versus fly proteins per cluster were plotted.

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1329

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 28: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

By this measure, the duplication segmentspans nearly half of each chromosome’s netlength. The most likely scenario is that thewhole span of this region was duplicated as asingle very large block, followed by shufflingowing to smaller scale rearrangements. Assuch, at least four subsequent rearrangementswould need to be invoked to explain therelative insertions and inversions seen in theduplicated segment interval. The 64 proteinpairs in this alignment occur among 217 pro-tein assignments on chromosome 18, andamong 322 protein assignments on chromo-some 20, for a density of involved proteins of20 to 30%. This is consistent with an ancientlarge-scale duplication followed by subse-quent gene loss on one or both chromosomes.Loss of just one member of a gene pairsubsequent to the duplication would result ina failure to score a gene pair in the block; lessthan 50% gene loss on the chromosomeswould lead to the duplication density ob-served here. As an independent verificationof the significance of the alignments detect-ed, it can be seen that a substantial number ofthe pairs of aligning proteins in this duplica-tion, including some of those annotated (Fig.13), are those populating small Lek completeclusters (see above). This indicates that theyare members of very small families of para-logs; their relative scarcity within the genomevalidates the uniqueness and robust nature oftheir alignments.

Two additional qualitative features were ob-served among many of the large-scale duplica-tions. First, several proteins with disease asso-ciations, with OMIM (Online Mendelian Inher-itance in Man) assignments, are members ofduplicated segments (see web table 2 on Sci-ence Online at www.sciencemag.org/cgi/con-tent/full/291/5507/1304/DC1). We have alsoobserved a few instances where paralogs onboth duplicated segments are associated withsimilar disease conditions. Notable amongthese genes are proteins involved in hemostasis(coagulation factors) that are associated withbleeding disorders, transcriptional regulatorslike the homeobox proteins associated with de-velopmental disorders, and potassium channelsassociated with cardiovascular conduction ab-normalities. For each of these disease genes,closer study of the paralogous genes in theduplicated segment may reveal new insightsinto disease causation, with further investiga-tion needed to determine whether they might beinvolved in the same or similar genetic diseases.Second, although there is a conserved numberof proteins and coding exons predicted for spe-cific large duplicated spans within the chromo-some 18 to 20 alignment, the genomic DNA ofchromosome 18 in these specific spans is insome cases more than 10-fold longer than thecorresponding chromosome 20 DNA. This se-lective accretion of noncoding DNA (or con-versely, loss of noncoding DNA) on one of a

pair of duplicated chromosome regions wasobserved in many compared regions. Hypothe-ses to explain which mechanisms foster theseprocesses must be tested.

Evaluation of the alignment results givessome perspective on dating of the duplications.As noted above, large-scale ancient segmentalduplication in fact best explains many of theblocks detected by this genome-wide analysis.The regions of human chromosomes involvedin the large-scale duplications expanded uponabove (chromosomes 2 to 14, 2 to 12, and 18 to20) are each syntenic to a distinct mouse chro-mosomal region. The corresponding mousechromosomal regions are much more similar insequence conservation, and even in order, totheir human synteny partners than the humanduplication regions are to each other. Further,the corresponding mouse chromosomal regionseach bear a significant proportion of genes or-thologous to the human genes on which thehuman duplication assignments were made. Onthe basis of these factors, the correspondingmouse chromosomal spans, at coarse resolu-tion, appear to be products of the same large-scale duplications observed in humans. Al-though further detailed analysis must be carriedout once a more complete genome is assembledfor mouse, the underlying large duplicationsappear to predate the two species’ divergence.This dates the duplications, at the latest, beforedivergence of the primate and rodent lineages.This date can be further refined upon examina-tion of the synteny between human chromo-somes and those of chicken, pufferfish (Fugurubripes), or zebrafish (95). The only sub-stantial syntenic stretches mapped in thesespecies corresponding to both pairs of humanduplications are restricted to the Hox clusterregions. When the synteny of these regions(or others) to human chromosomes is extend-ed with further mapping, the ages of thenearly chromosome-length duplications seenin humans are likely to be dated to the root ofvertebrate divergence.

The MUMmer-based results demonstratelarge block duplications that range in size froma few genes to segments covering most of achromosome. The extent of segmental duplica-tions raises the question of whether an ancientwhole-genome duplication event is the under-lying explanation for the numerous duplicatedregions (96). The duplications have undergonemany deletions and subsequent rearrangements;these events make it difficult to distinguishbetween a whole-genome duplication and mul-tiple smaller events. Further analysis, focusedespecially on comparing the estimated ages ofall the block duplications, derived partiallyfrom interspecies genome comparisons, will benecessary to determine which of these two hy-potheses is more likely. Comparisons of ge-nomes of different vertebrates, and even cross-phyla genome comparisons, will allow for thedeconvolution of duplications to eventually re-

veal the stagewise history of our genome, andwith it a history of the emergence of many ofthe key functions that distinguish us from otherliving things.

6 A Genome-Wide Examination ofSequence VariationsSummary. Computational methods were usedto identify single-nucleotide polymorphisms(SNPs) by comparison of the Celera sequenceto other SNP resources. The SNP rate be-tween two chromosomes was ;1 per 1200 to1500 bp. SNPs are distributed nonrandomlythroughout the genome. Only a very smallproportion of all SNPs (,1%) potentiallyimpact protein function based on the func-tional analysis of SNPs that affect the pre-dicted coding regions. This results in an es-timate that only thousands, not millions, ofgenetic variations may contribute to the struc-tural diversity of human proteins.

Having a complete genome sequence enablesresearchers to achieve a dramatic accelerationin the rate of gene discovery, but only throughanalysis of sequence variation in DNA can wediscover the genetic basis for variation in healthamong human beings. Whole-genome shotgunsequencing is a particularly effective methodfor detecting sequence variation in tandem withwhole-genome assembly. In addition, we com-pared the distribution and attributes of SNPsascertained by three other methods: (i) align-ment of the Celera consensus sequence to thePFP assembly, (ii) overlap of high-quality readsof genomic sequence (referred to as “Kwok”;1,120,195 SNPs) (97), and (iii) reduced repre-sentation shotgun sequencing (referred to as“TSC”; 632,640 SNPs) (98). These data wereconsistent in showing an overall nucleotide di-versity of ;8 3 1024, marked heterogeneityacross the genome in SNP density, and anoverwhelming preponderance of noncodingvariation that produces no change in expressedproteins.

6.1 SNPs found by aligning the Celeraconsensus to the PFP assemblyIdeally, methods of SNP discovery make fulluse of sequence depth and quality at every site,and quantitatively control the rate of false-pos-itive and false-negative calls with an explicitsampling model (99). Comparison of consensussequences in the absence of these details neces-sitated a more ad hoc approach (quality scorescould not readily be obtained for the PFP as-sembly). First, all sequence differences betweenthe two consensus sequences were identified;these were then filtered to reduce the contribu-tion of sequencing errors and misassembly. Asa measure of the effectiveness of the filteringstep, we monitored the ratio of transition andtransversion substitutions, because a 2:1 ratiohas been well documented as typical in mam-malian evolution (100) and in human SNPs

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1330

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 29: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

(101, 102). The filtering steps consisted of re-moving variants where the quality score in theCelera consensus was less than 30 and wherethe density of variants was greater than 5 in 400bp. These filters resulted in shifting the transi-tion-to-transversion ratio from 1.57:1 to1.89:1. When applied to 2.3 Gbp of alignmentsbetween the Celera and PFP consensus se-quences, these filters resulted in identificationof 2,104,820 putative SNPs from a total of2,778,474 substitution differences. Overlapsbetween this set of SNPs and those found byother methods are described below.

6.2 Comparisons to public SNPdatabasesAdditional SNPs, including 2,536,021 fromdbSNP (www.ncbi.nlm.nih.gov/SNP) and13,150 from HGMD (Human Gene Muta-tion Database, from the University ofWales, UK), were mapped on the Celera con-sensus sequence by a sequence similaritysearch with the program PowerBlast (103). Thetwo largest data sets in dbSNP are the Kwokand TSC sets, with 47% and 25% of the dbSNPrecords. Low-quality alignments with partialcoverage of the dbSNP sequence and align-ments that had less than 98% sequence identitybetween the Celera sequence and the dbSNPflanking sequence were eliminated. dbSNP se-quences mapping to multiple locations on theCelera genome were discarded. A total of2,336,935 dbSNP variants were mapped to1,223,038 unique locations on the Celera se-quence, implying considerable redundancy indbSNP. SNPs in the TSC set mapped to585,811 unique genomic locations, and SNPs inthe Kwok set mapped to 438,032 unique loca-tions. The combined unique SNPs counts usedin this analysis, including Celera-PFP, TSC,and Kwok, is 2,737,668. Table 15 shows that asubstantial fraction of SNPs identified by one ofthese methods was also found by another meth-od. The very high overlap (36.2%) between theKwok and Celera-PFP SNPs may be due in partto the use by Kwok of sequences that went intothe PFP assembly. The unusually low overlap(16.4%) between the Kwok and TSC sets is due

to their being the smallest two sets. In addition,24.5% of the Celera-PFP SNPs overlap withSNPs derived from the Celera genome se-quences (46). SNP validation in populationsamples is an expensive and laborious process,so confirmation on multiple data sets may pro-vide an efficient initial validation “in silico” (bycomputational analysis).

One means of assessing whether thethree sets of SNPs provide the same pictureof human variation is to tally the frequen-cies of the six possible base changes ineach set of SNPs (Table 16). Previous mea-sures of nucleotide diversity were mostlyderived from small-scale analysis on can-didate genes (101), and our analysis withall three data sets validates the previousobservations at the whole-genome scale.There is remarkable homogeneity betweenthe SNPs found in the Kwok set, the TSCset, and in our whole-genome shotgun (46 )in this substitution pattern. Compared withthe rest of the data sets, Celera-PFP devi-ates slightly from the 2 :1 transition-to-transversion ratio observed in the otherSNP sets. This result is not unexpected,because some fraction of the computation-ally identified SNPs in the Celera-PFPcomparison may in fact be sequence errors.A 2 :1 transition:transversion ratio for thebona fide SNPs would be obtained if oneassumed that 15% of the sequence differ-ences in the Celera-PFP set were a result of( presumably random) sequence errors.

6.3 Estimation of nucleotide diversityfrom ascertained SNPsThe number of SNPs identified variedwidely across chromosomes. In order tonormalize these values to the chromosomesize and sequence coverage, we used p, thestandard statistic for nucleotide diversity(104 ). Nucleotide diversity is a measure ofper-site heterozygosity, quantifying theprobability that a pair of chromosomesdrawn from the population will differ at anucleotide site. In order to calculate nucle-otide diversity for each chromosome, weneed to know the number of nucleotidesites that were surveyed for variation, andin methods like reduced respresentation se-quencing, we need to know the sequencequality and the depth of coverage at each

site. These data are not readily available, sowe could not estimate nucleotide diversityfrom the TSC effort. Estimation of nucleo-tide diversity from high-quality sequenceoverlaps should be possible, but again,more information is needed on the detailsof all the alignments.

Estimation of nucleotide diversity from ashotgun assembly entails calculating for eachcolumn of the multialignment, the probabilitythat two or more distinct alleles are present,and the probability of detecting a SNP if infact the alleles have different sequence (i.e.,the probability of correct sequence calls). Thegreater the depth of coverage and the higherthe sequence quality, the higher is the chanceof successfully detecting a SNP (105). Evenafter correcting for variation in coverage, thenucleotide diversity appeared to vary acrossautosomes. The significance of this heteroge-neity was tested by analysis of variance, withestimates of p for 100-kbp windows to esti-mate variability within chromosomes (for theCelera-PFP comparison, F 5 29.73, P ,0.0001).

Average diversity for the autosomes es-timated from the Celera-PFP comparisonwas 8.94 3 1024. Nucleotide diversity onthe X chromosome was 6.54 3 1024. TheX is expected to be less variable than au-tosomes, because for every four copies ofautosomes in the population, there are onlythree X chromosomes, and this smaller ef-fective population size means that randomdrift will more rapidly remove variationfrom the X (106 ).

Having ascertained nucleotide variationgenome-wide, it appears that previous esti-mates of nucleotide diversity in humansbased on samples of genes were reasonablyaccurate (101, 102, 106, 107 ). Genome-wide,our estimate of nucleotide diversity was8.98 3 10-4 for the Celera-PFP alignment,and a published estimate averaged over 10densely resequenced human genes was8.00 3 1024 (108).

6.4 Variation in nucleotide diversityacross the human genomeSuch an apparently high degree of variabil-ity among chromosomes in SNP densityraises the question of whether there is het-erogeneity at a finer scale within chromo-

Table 15. Overlap of SNPs from genome-wideSNP databases. Table entries are SNP counts foreach pair of data sets. Numbers in parentheses arethe fraction of overlap, calculated as the count ofoverlapping SNPs divided by the number of SNPsin the smaller of the two databases compared.Total SNP counts for the databases are: Celera-PFP, 2,104,820; TSC, 585,811; and Kwok 438,032.Only unique SNPs in the TSC and Kwok data setswere included.

TSC Kwok

Celera-PFP 188,694 158,532(0.322) (0.362)

TSC 72,024(0.164)

Table 16. Summary of nucleotide changes in different SNP data sets.

SNP data setA/G(%)

C/T(%)

A/C(%)

A/T(%)

C/G(%)

T/G(%)

Transition:transversion

Celera-PFP 30.7 30.7 10.3 8.6 9.2 10.3 1.59:1Kwok* 33.7 33.8 8.5 7.0 8.6 8.4 2.07:1TSC† 33.3 33.4 8.8 7.3 8.6 8.6 1.99:1

*November 2000 release of the NCBI database dbSNP (www.nci.nlm.nih.gov/SNP/) with the method defined as OverlapSnpDetectionWithPolyBayes. The submitter of the data is Pui-Yan Kwok from Washington University. †November2000 release of NCBI dbSNP (www.ncbi.nlm.nih.gov/SNP/) with the methods defined as TSC-Sanger, TSC-WICGR, andTSC-WUGSC. The submitter of the data is Lincoln Stein from Cold Spring Harbor Laboratory.

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1331

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 30: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

Fig. 13. Segmental duplica-tions between chromo-somes in the human ge-nome. The 24 panels showthe 1077 duplicated blocksof genes, containing 10,310pairs of genes in total. Eachline represents a pair of ho-mologous genes belongingto a block; all blocks con-tain at least three geneson each of the chromo-somes where they appear.Each panel shows all theduplications between asingle chromosome andother chromosomes withshared blocks. The chro-mosome at the center ofeach panel is shown as athick red line for emphasis.Other chromosomes aredisplayed from top to bot-tom within each panel or-dered by chromosomenumber. The inset (bot-tom, center right) shows aclose-up of one duplica-tion between chromo-somes 18 and 20, expand-ed to display the genenames of 12 of the 64gene pairs shown.

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1332

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 31: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1333

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 32: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

somes, and whether this heterogeneity isgreater than expected by chance. If SNPsoccur by random and independent mutations,then it would seem that there ought to be aPoisson distribution of numbers of SNPs infragments of arbitrary constant size. The ob-served dispersion in the distribution of SNPsin 100-kbp fragments was far greater thanpredicted from a Poisson distribution (Fig.14). However, this simplistic model ignoresthe different recombination rates and popula-tion histories that exist in different regions ofthe genome. Population genetics theory holdsthat we can account for this variation with amathematical formulation called the neutralcoalescent (109). Applying well-tested algo-rithms for simulating the neutral coalescentwith recombination (110), and using an ef-fective population size of 10,000 and a per-base recombination rate equal to the mutationrate (111), we generated a distribution of num-bers of SNPs by this model as well (112). Theobserved distribution of SNPs has a much larg-er variance than either the Poisson model or thecoalescent model, and the difference is highlysignificant. This implies that there is significantvariability across the genome in SNP density,an observation that begs an explanation.

Several attributes of the DNA sequencemay affect the local density of SNPs, in-cluding the rate at which DNA polymerasemakes errors and the efficacy of mismatchrepair. One key factor that is likely to beassociated with SNP density is the G1Ccontent, in part because methylated cy-tosines in CpG dinucleotides tend to under-go deamination to form thymine, account-ing for a nearly 10-fold increase in themutation rate of CpGs over other dinucle-

otides. We tallied the GC content and nu-cleotide diversities in 100-kbp windowsacross the entire genome and found that thecorrelation between them was positive (r 50.21) and highly significant (P , 0.0001),but G1C content accounted for only asmall part of the variation.

6.5 SNPs by genomic classTo test homogeneity of SNP densitiesacross functional classes, we partitionedsites into intergenic (defined as .5 kbpfrom any predicted transcription unit), 59-UTR, exonic (missense and silent), in-tronic, and 39-UTR for 10,239 knowngenes, derived from the NCBI RefSeq da-tabase and all human genes predicted fromthe Celera Otto annotation. In coding re-gions, SNPs were categorized as either si-lent, for those that do not change aminoacid sequence, or missense, for those thatchange the protein product. The ratio ofmissense to silent coding SNPs in Celera-PFP, TSC, and Kwok sets (1.12, 0.91, and0.78, respectively) shows a markedly re-duced frequency of missense variants com-pared with the neutral expectation, consis-tent with the elimination by natural selec-tion of a fraction of the deleterious aminoacid changes (112). These ratios are com-parable to the missense-to-silent ratios of0.88 and 1.17 found by Cargill et al. (101)and by Halushka et al. (102). Similar re-sults were observed in SNPs derived fromCelera shotgun sequences (46 ).

It is striking how small is the fraction ofSNPs that lead to potentially dysfunctionalalterations in proteins. In the 10,239 Ref-Seq genes, missense SNPs were only about

0.12, 0.14, and 0.17% of the total SNPcounts in Celera-PFP, TSC, and KwokSNPs, respectively. Nonconservative pro-tein changes constitute an even smaller frac-tion of missense SNPs (47, 41, and 40% inCelera-PFP, Kwok, and TSC). Intergenic re-gions have been virtually unstudied (113), andwe note that 75% of the SNPs we identifiedwere intergenic (Table 17). The SNP rate washighest in introns and lowest in exons. The SNPrate was lower in intergenic regions than inintrons, providing one of the first discriminatorsbetween these two classes of DNA. These SNPrates were confirmed in the Celera SNPs, whichalso exhibited a lower rate in exons than inintrons, and in extragenic regions than in in-trons (46). Many of these intergenic SNPs willprovide valuable information in the form ofmarkers for linkage and association studies, andsome fraction is likely to have a regulatoryfunction as well.

7 An Overview of the PredictedProtein-Coding Genes in the HumanGenomeSummary. This section provides an initialcomputational analysis of the predictedprotein set with the aim of catalogingprominent differences and similaritieswhen the human genome is compared withother fully sequenced eukaryotic genomes.Over 40% of the predicted protein set inhumans cannot be ascribed a molecularfunction by methods that assign proteins toknown families. A protein domain– basedanalysis provides a detailed catalog of theprominent differences in the human ge-nome when compared with the fly andworm genomes. Prominent among these aredomain expansions in proteins involved indevelopmental regulation and in cellularprocesses such as neuronal function, hemo-stasis, acquired immune response, and cy-toskeletal complexity. The final enumera-tion of protein families and details of pro-tein structure will rely on additional exper-imental work and comprehensive manualcuration.

A preliminary analysis of the predicted hu-man protein-coding genes was conducted.Two methods were used to analyze and clas-sify the molecular functions of 26,588 pre-dicted proteins that represent 26,383 genepredictions with at least two lines of evidenceas described above. The first method wasbased on an analysis at the level of proteinfamilies, with both the publicly availablePfam database (114, 115) and Celera’s Pan-ther Classification (CPC) (Fig. 15) (116 ).The second method was based on an analysisat the level of protein domains, with both thePfam and SMART databases (115, 117 ).

The results presented here are prelimi-nary and are subject to several limitations.

Fig. 14. SNP density in each 100-kbp interval as determined with Celera-PFP SNPs. The color codesare as follows: black, Celera-PFP SNP density; blue, coalescent model; and red, Poisson distribution.The figure shows that the distribution of SNPs along the genome is nonrandom and is not entirelyaccounted for by a coalescent model of regional history.

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1334

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 33: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

Both the gene predictions and functionalassignments have been made by using com-putational tools, although the statisticalmodels in Panther, Pfam, and SMART havebeen built, annotated, and reviewed by ex-pert biologists. In the set of computationallypredicted genes, we expect both false-positivepredictions (some of these may in fact be inac-tive pseudogenes) and false-negative predic-tions (some human genes will not be computa-tionally predicted). We also expect errors indelimiting the boundaries of exons and genes.Similarly, in the automatic functional assign-ments, we also expect both false-positive andfalse-negative predictions. The functional as-signment protocol focuses on protein familiesthat tend to be found across several organisms,or on families of known human genes. There-fore, we do not assign a function to many genesthat are not in large families, even if the func-tion is known. Unless otherwise specified, allenumeration of the genes in any given family orfunctional category was taken from the set of26,588 predicted proteins, which were assignedfunctions by using statistical score cutoffs de-fined for models in Panther, Pfam, andSMART.

For this initial examination of the pre-dicted human protein set, three broad ques-tions were asked: (i) What are the likelymolecular functions of the predicted geneproducts, and how are these proteins cate-gorized with current classification meth-ods? (ii) What are the core functions thatappear to be common across the animals?

(iii) How does the human protein comple-ment differ from that of other sequencedeukaryotes?

7.1 Molecular functions of predictedhuman proteinsFigure 15 shows an overview of the puta-tive molecular functions of the predicted26,588 human proteins that have at leasttwo lines of supporting evidence. About41% (12,809) of the gene products couldnot be classified from this initial analysisand are termed proteins with unknownfunctions. Because our automatic classifi-cation methods treat only relatively largeprotein families, there are a number of“unclassified” sequences that do, in fact,have a known or predicted function. For the60% of the protein set that have automaticfunctional predictions, the specific proteinfunctions have been placed into broadclasses. We focus here on molecular func-tion (rather than higher order cellular pro-cesses) in order to classify as many proteinsas possible. These functional predictionsare based on similarity to sequences ofknown function.

In our analysis of the 12,731 additional low-confidence predicted genes (those with onlyone piece of supporting evidence), only 636(5%) of these additional putative genes wereassigned molecular functions by the automatedmethods. One-third of these 636 predictedgenes represented endogenous retroviral pro-teins, further suggesting that the majority of

these unknown-function genes are not realgenes. Given that most of these additional12,095 genes appear to be unique among thegenomes sequenced to date, many may simplyrepresent false-positive gene predictions.

The most common molecular functions arethe transcription factors and those involved innucleic acid metabolism (nucleic acid enzyme).Other functions that are highly represented inthe human genome are the receptors, kinases,and hydrolases. Not surprisingly, most of thehydrolases are proteases. There are also manyproteins that are members of proto-oncogenefamilies, as well as families of “select regula-tory molecules”: (i) proteins involved in specif-ic steps of signal transduction such as hetero-trimeric GTP-binding proteins (G proteins) andcell cycle regulators, and (ii) proteins that mod-ulate the activity of kinases, G proteins, andphosphatases.

Fig. 15. Distributionof the molecularfunctions of 26,383human genes. Eachslice lists the num-bers and percentages(in parentheses) ofhuman gene functionsassigned to a givencategory of molecularfunction. The outer cir-cle shows the assign-ment to molecularfunction categories inthe Gene Ontology(GO) (179), and theinner circle showsthe assignment toCelera’s Panther mo-lecular function cate-gories (116).

Table 17. Distribution of SNPs in classes ofgenomic regions.

Genomic regionclass

Size ofregion

examined(Mb)

Celera-PFPSNP

density(SNP/Mb)

Intergenic 2185 707Gene (intron 1

exon)646 917

Intron 615 921First intron 164 808Exon 31 529First exon 10 592

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1335

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 34: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

7.2 Evolutionary conservation of coreprocesses

Because of the various “model organism”genome-sequencing projects that have al-ready been completed, reasonable compara-tive information is available for beginning theanalysis of the evolution of the human ge-nome. The genomes of S. cerevisiae (“bak-ers’ yeast”) (118) and two diverse inverte-brates, C. elegans (a nematode worm) (119)and D. melanogaster (fly) (26 ), as well as thefirst plant genome, A. thaliana, recently com-pleted (92), provide a diverse background forgenome comparisons.

We enumerated the “strict orthologs” con-served between human and fly, and betweenhuman and worm (Fig. 16) to address thequestion, What are the core functions thatappear to be common across the animals?The concept of orthology is important be-cause if two genes are orthologs, they can betraced by descent to the common ancestor ofthe two organisms (an “evolutionarily con-served protein set”), and therefore are likelyto perform similar conserved functions in thedifferent organisms. It is critical in this anal-ysis to separate orthologs (a gene that appearsin two organisms by descent from a commonancestor) from paralogs (a gene that appearsin more than one copy in a given organism bya duplication event) because paralogs maysubsequently diverge in function. Followingthe yeast-worm ortholog comparison in

(120), we identified two different cases foreach pairwise comparison (human-fly andhuman-worm). The first case was a pair ofgenes, one from each organism, for whichthere was no other close homolog in eitherorganism. These are straightforwardly identi-fied as orthologous, because there are noadditional members of the families that com-plicate separating orthologs from paralogs.The second case is a family of genes withmore than one member in either or both of theorganisms being compared. Chervitz et al.(120) deal with this case by analyzing aphylogenetic tree that described the relation-ships between all of the sequences in bothorganisms, and then looked for pairs of genesthat were nearest neighbors in the tree. If thenearest-neighbor pairs were from differentorganisms, those genes were presumed to beorthologs. We note that these nearest neigh-bors can often be confidently identified frompairwise sequence comparison without hav-ing to examine a phylogenetic tree (see leg-end to Fig. 16). If the nearest neighbors arenot from different organisms, there has beena paralogous expansion in one or both organ-isms after the speciation event (and/or a geneloss by one organism). When this one-to-onecorrespondence is lost, defining an orthologbecomes ambiguous. For our initial compu-tational overview of the predicted human pro-tein set, we could not answer this question forevery predicted protein. Therefore, we con-

sider only “strict orthologs,” i.e., the proteinswith unambiguous one-to-one relationships(Fig. 16). By these criteria, there are 2758strict human-fly orthologs, 2031 human-worm (1523 in common between these sets).We define the evolutionarily conserved set asthose 1523 human proteins that have strictorthologs in both D. melanogaster and C.elegans.

The distribution of the functions of theconserved protein set is shown in Fig. 16.Comparison with Fig. 15 shows that, notsurprisingly, the set of conserved proteins isnot distributed among molecular functions inthe same way as the whole human protein set.Compared with the whole human set (Fig.15), there are several categories that are over-represented in the conserved set by a factor of;2 or more. The first category is nucleic acidenzymes, primarily the transcriptional ma-chinery (notably DNA/RNA methyltrans-ferases, DNA/RNA polymerases, helicases,DNA ligases, DNA- and RNA-processingfactors, nucleases, and ribosomal proteins).The basic transcriptional and translationalmachinery is well known to have been con-served over evolution, from bacteria throughto the most complex eukaryotes. Many ribo-nucleoproteins involved in RNA splicing alsoappear to be conserved among the animals.Other enzyme types are also overrepresent-ed (transferases, oxidoreductases, ligases,lyases, and isomerases). Many of these en-

Fig. 16. Functions of putativeorthologs across vertebrateand invertebrate genomes.Each slice lists the number andpercentages (in parentheses)of “strict orthologs” betweenthe human, fly, and worm ge-nomes involved in a given cat-egory of molecular function.“Strict orthologs” are definedhere as bi-directional BLASTbest hits (180) such that eachorthologous pair (i) has aBLASTP P-value of #10210

(120), and (ii) has a more sig-nificant BLASTP score thanany paralogs in either organ-ism, i.e., there has likely beenno duplication subsequent tospeciation that might makethe orthology ambiguous. Thismeasure is quite strict and is alower bound on the number oforthologs. By these criteria,there are 2758 strict human-fly orthologs, and 2031 hu-man-worm orthologs (1523 incommon between these sets).

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1336

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 35: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

zymes are involved in intermediary metabo-lism. The only exception is the hydrolasecategory, which is not significantly overrep-resented in the shared protein set. Proteasesform the largest part of this category, andseveral large protease families have expandedin each of these three organisms after theirdivergence. The category of select regulatorymolecules is also overrepresented in the con-served set. The major conserved families aresmall guanosine triphosphatases (GTPases)(especially the Ras-related superfamily, in-cluding ADP ribosylation factor) and cellcycle regulators (particularly the cullin fam-ily, cyclin C family, and several cell divisionprotein kinases). The last two significantlyoverrepresented categories are protein trans-port and trafficking, and chaperones. Themost conserved groups in these categories areproteins involved in coated vesicle-mediatedtransport, and chaperones involved in proteinfolding and heat-shock response [particularlythe DNAJ family, and heat-shock protein60 (HSP60), HSP70, and HSP90 families].These observations provide only a conserva-tive estimate of the protein families in thecontext of specific cellular processes thatwere likely derived from the last commonancestor of the human, fly, and worm. Asstated before, this analysis does not provide acomplete estimate of conservation across thethree animal genomes, as paralogous dupli-cation makes the determination of true or-thologs difficult within the members of con-served protein families.

7.3 Differences between the humangenome and other sequencedeukaryotic genomesTo explore the molecular building blocks ofthe vertebrate taxon, we have compared thehuman genome with the other sequencedeukaryotic genomes at three levels: molec-ular functions, protein families, and proteindomains.

Molecular differences can be correlatedwith phenotypic differences to begin to revealthe developmental and cellular processes thatare unique to the vertebrates. Tables 18 and19 display a comparison among all sequencedeukaryotic genomes, over selected protein/domain families (defined by sequence simi-larity, e.g., the serine-threonine protein ki-nases) and superfamilies (defined by sharedmolecular function, which may include sev-eral sequence-related families, e.g., the cyto-kines). In these tables we have focused on(super) families that are either very large orthat differ significantly in humans comparedwith the other sequenced eukaryote genomes.We have found that the most prominent hu-man expansions are in proteins involved in (i)acquired immune functions; (ii) neural devel-opment, structure, and functions; (iii) inter-cellular and intracellular signaling pathways

in development and homeostasis; (iv) hemo-stasis; and (v) apoptosis.

Acquired immunity. One of the moststriking differences between the human ge-nome and the Drosophila or C. elegans ge-nome is the appearance of genes involved inacquired immunity (Tables 18 and 19). Thisis expected, because the acquired immuneresponse is a defense system that only occursin vertebrates. We observe 22 class I and 22class II major histocompatibility complex(MHC) antigen genes and 114 other immu-noglobulin genes in the human genome. Inaddition, there are 59 genes in the cognateimmunoglobulin receptor family. At the do-main level, this is exemplified by an expan-sion and recruitment of the ancient immuno-globulin fold to constitute molecules such asMHC, and of the integrin fold to form severalof the cell adhesion molecules that mediateinteractions between immune effector cellsand the extracellular matrix. Vertebrate-spe-cific proteins include the paracrine immuneregulators family of secreted 4-alpha helicalbundle proteins, namely the cytokines andchemokines. Some of the cytoplasmic signaltransduction components associated with cy-tokine receptor signal transduction are alsofeatures that are poorly represented in the flyand worm. These include protein domainsfound in the signal transducer and activator oftranscription (STATs), the suppressors of cy-tokine signaling (SOCS), and protein inhibi-tors of activated STATs (PIAS). In contrast,many of the animal-specific protein domainsthat play a role in innate immune response,such as the Toll receptors, do not appear to besignificantly expanded in the human genome.

Neural development, structure, andfunction. In the human genome, as comparedwith the worm and fly genomes, there is amarked increase in the number of membersof protein families that are involved inneural development. Examples include neu-rotrophic factors such as ependymin, nervegrowth factor, and signaling moleculessuch as semaphorins, as well as the numberof proteins involved directly in neuralstructure and function such as myelin pro-teins, voltage-gated ion channels, and syn-aptic proteins such as synaptotagmin.These observations correlate well with theknown phenotypic differences between thenervous systems of these taxa, notably (i)the increase in the number and connectivityof neurons; (ii) the increase in number ofdistinct neural cell types (as many as athousand or more in human compared witha few hundred in fly and worm) (121); (iii)the increased length of individual axons;and (iv) the significant increase in glial cellnumber, especially the appearance of my-elinating glial cells, which are electricallyinert supporting cells differentiated fromthe same stem cells as neurons. A number

of prominent protein expansions are in-volved in the processes of neural develop-ment. Of the extracellular domains that me-diate cell adhesion, the connexin domain–containing proteins (122) exist only in hu-mans. These proteins, which are not presentin the Drosophila or C. elegans genomes,appear to provide the constitutive subunitsof intercellular channels and the structuralbasis for electrical coupling. Pathway find-ing by axons and neuronal network forma-tion is mediated through a subset of ephrinsand their cognate receptor tyrosine kinasesthat act as positional labels to establishtopographical projections (123). The prob-able biological role for the semaphorins (22in human compared with 6 in the fly and 2in the worm) and their receptors (neuropi-lins and plexins) is that of axonal guidancemolecules (124 ). Signaling molecules suchas neurotrophic factors and some cytokineshave been shown to regulate neuronal cellsurvival, proliferation, and axon guidance(125). Notch receptors and ligands playimportant roles in glial cell fate determina-tion and gliogenesis (126 ).

Other human expanded gene families playkey roles directly in neural structure andfunction. One example is synaptotagmin (ex-panded more than twofold in humans relativeto the invertebrates), originally found to reg-ulate synaptic transmission by serving as aCa21 sensor (or receptor) during synapticvesicle fusion and release (127 ). Of interest isthe increased co-occurrence in humans ofPDZ and the SH3 domains in neuronal-specific adaptor molecules; examples includeproteins that likely modulate channel activityat synaptic junctions (128). We also notedexpansions in several ion-channel families(Table 19), including the EAG subfamily(related to cyclic nucleotide gated channels),the voltage-gated calcium/sodium channelfamily, the inward-rectifier potassium chan-nel family, and the voltage-gated potassiumchannel, alpha subunit family. Voltage-gatedsodium and potassium channels are involvedin the generation of action potentials in neu-rons. Together with voltage-gated calciumchannels, they also play a key role in cou-pling action potentials to neurotransmitter re-lease, in the development of neurites, and inshort-term memory. The recent observationof a calcium-regulated association betweensodium channels and synaptotagmin mayhave consequences for the establishment andregulation of neuronal excitability (129).

Myelin basic protein and myelin-associat-ed glycoprotein are major classes of proteincomponents in both the central and peripheralnervous system of vertebrates. Myelin P0 is amajor component of peripheral myelin, andmyelin proteolipid and myelin oligodendro-cyte glycopotein are found in the centralnervous system. Mutations in any of these

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1337

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 36: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

Table 18. Domain-based comparative analysis of proteins in H. sapiens (H),D. melanogaster (F), C. elegans (W), S. cerevisiae (Y), and A. thaliana (A). Thepredicted protein set of each of the above eukaryotic organisms was analyzedwith Pfam version 5.5 using E value cutoffs of 0.001. The number of proteinscontaining the specified Pfam domains as well as the total number of domains(in parentheses) are shown in each column. Domains were categorized intocellular processes for presentation. Some domains (i.e., SH2) are listed in

more than one cellular process. Results of the Pfam analysis may differ fromresults obtained based on human curation of protein families, owing to thelimitations of large-scale automatic classifications. Representative examplesof domains with reduced counts owing to the stringent E value cutoff used forthis analysis are marked with a double asterisk (**). Examples include shortdivergent and predominantly alpha-helical domains, and certain classes ofcysteine-rich zinc finger proteins.

Accessionnumber

Domain name Domain description H F W Y A

Developmental and homeostatic regulatorsPF02039 Adrenomedullin Adrenomedullin 1 0 0 0 0PF00212 ANP Atrial natriuretic peptide 2 0 0 0 0PF00028 Cadherin Cadherin domain 100 (550) 14 (157) 16 (66) 0 0PF00214 Calc_CGRP_IAPP Calcitonin/CGRP/IAPP family 3 0 0 0 0PF01110 CNTF Ciliary neurotrophic factor 1 0 0 0 0PF01093 Clusterin Clusterin 3 0 0 0 0PF00029 Connexin Connexin 14 (16) 0 0 0 0PF00976 ACTH_domain Corticotropin ACTH domain 1 0 0 0 0PF00473 CRF Corticotropin-releasing factor family 2 1 0 0 0PF00007 Cys_knot Cystine-knot domain 10 (11) 2 0 0 0PF00778 DIX Dix domain 5 2 4 0 0PF00322 Endothelin Endothelin family 3 0 0 0 0PF00812 Ephrin Ephrin 7 (8) 2 4 0 0PF01404 EPh_Ibd Ephrin receptor ligand binding domain 12 2 1 0 0PF00167 FGF Fibroblast growth factor 23 1 1 0 0PF01534 Frizzled Frizzled/Smoothened family membrane region 9 7 3 0 0PF00236 Hormone6 Glycoprotein hormones 1 0 0 0 0PF01153 Glypican Glypican 14 2 1 0 0PF01271 Granin Grainin (chromogranin or secretogranin) 3 0 0 0 0PF02058 Guanylin Guanylin precursor 1 0 0 0 0PF00049 Insulin Insulin/IGF/Relaxin family 7 4 0 0 0PF00219 IGFBP Insulin-like growth factor binding proteins 10 0 0 0 0PF02024 Leptin Leptin 1 0 0 0 0PF00193 Xlink LINK (hyaluron binding) 13 (23) 0 1 0 0PF00243 NGF Nerve growth factor family 3 0 0 0 0PF02158 Neuregulin Neuregulin family 4 0 0 0 0PF00184 Hormone5 Neurohypophysial hormones 1 0 0 0 0PF02070 NMU Neuromedin U 1 0 0 0 0PF00066 Notch Notch (DSL) domain 3 (5) 2 (4) 2 (6) 0 0PF00865 Osteopontin Osteopontin 1 0 0 0 0PF00159 Hormone3 Pancreatic hormone peptides 3 0 0 0 0PF01279 Parathyroid Parathyroid hormone family 2 0 0 0 0PF00123 Hormone2 Peptide hormone 5 (9) 0 0 0 0PF00341 PDGF Platelet-derived growth factor (PDGF) 5 1 0 0 0PF01403 Sema Sema domain 27 (29) 8 (10) 3 (4) 0 0PF01033 Somatomedin_B Somatomedin B domain 5 (8) 3 0 0 0PF00103 Hormone Somatotropin 1 0 0 0 0PF02208 Sorb Sorbin homologous domain 2 0 0 0 0PF02404 SCF Stem cell factor 2 0 0 0 0PF01034 Syndecan Syndecan domain 3 1 1 0 0PF00020 TNFR_c6 TNFR/NGFR cysteine-rich region 17 (31) 1 0 0 0PF00019 TGF-b Transforming growth factor b-like domain 27 (28) 6 4 0 0PF01099 Uteroglobin Uteroglobin family 3 0 0 0 0PF01160 Opiods_neuropep Vertebrate endogenous opioids neuropeptide 3 0 0 0 0PF00110 Wnt Wnt family of developmental signaling proteins 18 7 (10) 5 0 0

HemostasisPF01821 ANATO Anaphylotoxin-like domain 6 (14) 0 0 0 0PF00386 C1q C1q domain 24 0 0 0 0PF00200 Disintegrin Disintegrin 18 2 3 0 0PF00754 F5_F8_type_C F5/8 type C domain 15 (20) 5 (6) 2 0 0PF01410 COLFI Fibrillar collagen C-terminal domain 10 0 0 0 0PF00039 Fn1 Fibronectin type I domain 5 (18) 0 0 0 0PF00040 Fn2 Fibronectin type II domain 11 (16) 0 0 0 0PF00051 Kringle Kringle domain 15 (24) 2 2 0 0PF01823 MACPF MAC/Perforin domain 6 0 0 0 0PF00354 Pentaxin Pentaxin family 9 0 0 0 0PF00277 SAA_proteins Serum amyloid A protein 4 0 0 0 0PF00084 Sushi Sushi domain (SCR repeat) 53 (191) 11 (42) 8 (45) 0 0PF02210 TSPN Thrombospondin N-terminal–like domains 14 1 0 0 0PF01108 Tissue_fac Tissue factor 1 0 0 0 0PF00868 Transglutamin_N Transglutaminase family 6 1 0 0 0PF00927 Transglutamin_C Transglutaminase family 8 1 0 0 0

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1338

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 37: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

Table 18 (Continued )

Accessionnumber

Domain name Domain description H F W Y A

PF00594 Gla Vitamin K-dependent carboxylation/gamma-carboxyglutamic (GLA) domain

11 0 0 0 0

Immune responsePF00711 Defensin_beta Beta defensin 1 0 0 0 0PF00748 Calpain_inhib Calpain inhibitor repeat 3 (9) 0 0 0 0PF00666 Cathelicidins Cathelicidins 2 0 0 0 0PF00129 MHC_I Class I histocompatibility antigen, domains alpha 1

and 218 (20) 0 0 0 0

PF00993 MHC_II_alpha** Class II histocompatibility antigen, alpha domain 5 (6) 0 0 0 0PF00969 MHC_II_beta** Class II histocompatibility antigen, beta domain 7 0 0 0 0PF00879 Defensin_propep Defensin propeptide 3 0 0 0 0PF01109 GM_CSF Granulocyte-macrophage colony-stimulating factor 1 0 0 0 0PF00047 Ig Immunoglobulin domain 381 (930) 125 (291) 67 (323) 0 0PF00143 Interferon Interferon alpha/beta domain 7 (9) 0 0 0 0PF00714 IFN-gamma Interferon gamma 1 0 0 0 0PF00726 IL10 Interleukin-10 1 0 0 0 0PF02372 IL15 Interleukin-15 1 0 0 0 0PF00715 IL2 Interleukin-2 1 0 0 0 0PF00727 IL4 Interleukin-4 1 0 0 0 0PF02025 IL5 Interleukin-5 1 0 0 0 0PF01415 IL7 Interleukin-7/9 family 1 0 0 0 0PF00340 IL1 Interleukin-1 7 0 0 0 0PF02394 IL1_propep Interleukin-1 propeptide 1 0 0 0 0PF02059 IL3 Interleukin-3 1 0 0 0 0PF00489 IL6 Interleukin-6/G-CSF/MGF family 2 0 0 0 0PF01291 LIF_OSM Leukemia inhibitory factor (LIF)/oncostatin (OSM)

family2 0 0 0 0

PF00323 Defensins Mammalian defensin 2 0 0 0 0PF01091 PTN_MK PTN/MK heparin-binding protein 2 0 0 0 0PF00277 SAA_proteins Serum amyloid A protein 4 0 0 0 0PF00048 IL8 Small cytokines (intecrine/chemokine),

interleukin-8 like32 0 0 0 0

PF01582 TIR TIR domain 18 8 2 0 131 (143)PF00229 TNF TNF (tumor necrosis factor) family 12 0 0 0 0PF00088 Trefoil Trefoil (P-type) domain 5 (6) 0 2 0 0

PI-PY-rho GTPase signalingPF00779 BTK BTK motif 5 1 0 0 0PF00168 C2 C2 domain 73 (101) 32 (44) 24 (35) 6 (9) 66 (90)PF00609 DAGKa Diacylglycerol kinase accessory domain (presumed) 9 4 7 0 6PF00781 DAGKc Diacylglycerol kinase catalytic domain (presumed) 10 8 8 2 11 (12)PF00610 DEP Domain found in Dishevelled, Egl-10, and

Pleckstrin (DEP)12 (13) 4 10 5 2

PF01363 FYVE FYVE zinc finger 28 (30) 14 15 5 15PF00996 GDI GDP dissociation inhibitor 6 2 1 1 3PF00503 G-alpha G-protein alpha subunit 27 (30) 10 20 (23) 2 5PF00631 G-gamma G-protein gamma like domains 16 5 5 1 0PF00616 RasGAP GTPase-activator protein for Ras-like GTPase 11 5 8 3 0PF00618 RasGEFN Guanine nucleotide exchange factor for Ras-like

GTPases; N-terminal motif9 2 3 5 0

PF00625 Guanylate_kin Guanylate kinase 12 8 7 1 4PF02189 ITAM Immunoreceptor tyrosine-based activation motif 3 0 0 0 0PF00169 PH PH domain 193 (212) 72 (78) 65 (68) 24 23PF00130 DAG_PE-bind Phorbol esters/diacylglycerol binding domain (C1

domain)45 (56) 25 (31) 26 (40) 1 (2) 4

PF00388 PI-PLC-X Phosphatidylinositol-specific phospholipase C, Xdomain

12 3 7 1 8

PF00387 PI-PLC-Y Phosphatidylinositol-specific phospholipase C, Ydomain

11 2 7 1 8

PF00640 PID Phosphotyrosine interaction domain (PTB/PID) 24 (27) 13 11 (12) 0 0PF02192 PI3K_p85B PI3-kinase family, p85-binding domain 2 1 1 0 0PF00794 PI3K_rbd PI3-kinase family, ras-binding domain 6 3 1 0 0PF01412 ArfGAP Putative GTP-ase activating protein for Arf 16 9 8 6 15PF02196 RBD Raf-like Ras-binding domain 6 (7) 4 1 0 0PF02145 Rap_GAP Rap/ran-GAP 5 4 2 0 0PF00788 RA Ras association (RalGDS/AF-6) domain 18 (19) 7 (9) 6 1 0PF00071 Ras Ras family 126 56 (57) 51 23 78PF00617 RasGEF RasGEF domain 21 8 7 5 0PF00615 RGS Regulator of G protein signaling domain 27 6 (7) 12 (13) 1 0PF02197 RIIa Regulatory subunit of type II PKA R-subunit 4 1 2 1 0

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1339

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 38: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

Table 18 (Continued )

Accessionnumber

Domain name Domain description H F W Y A

PF00620 RhoGAP RhoGAP domain 59 19 20 9 8PF00621 RhoGEF RhoGEF domain 46 23 (24) 18 (19) 3 0PF00536 SAM SAM domain (Sterile alpha motif) 29 (31) 15 8 3 6PF01369 Sec7 Sec7 domain 13 5 5 5 9PF00017 SH2 Src homology 2 (SH2) domain 87 (95) 33 (39) 44 (48) 1 3PF00018 SH3 Src homology 3 (SH3) domain 143 (182) 55 (75) 46 (61) 23 (27) 4PF01017 STAT STAT protein 7 1 1 (2) 0 0PF00790 VHS VHS domain 4 2 4 4 8PF00568 WH1 WH1 domain 7 2 2 (3) 1 0

Domains involved in apoptosisPF00452 Bcl-2 Bcl-2 9 2 1 0 0PF02180 BH4 Bcl-2 homology region 4 3 0 1 0 0PF00619 CARD Caspase recruitment domain 16 0 2 0 0PF00531 Death Death domain 16 5 7 0 0PF01335 DED Death effector domain 4 (5) 0 0 0 0PF02179 BAG Domain present in Hsp70 regulators 5 (8) 3 2 1 5PF00656 ICE_p20 ICE-like protease (caspase) p20 domain 11 7 3 0 0PF00653 BIR Inhibitor of Apoptosis domain 8 (14) 5 (9) 2 (3) 1 (2) 0

CytoskeletalPF00022 Actin Actin 61 (64) 15 (16) 12 9 (11) 24PF00191 Annexin Annexin 16 (55) 4 (16) 4 (11) 0 6 (16)PF00402 Calponin Calponin family 13 (22) 3 7 (19) 0 0PF00373 Band_41 FERM domain (Band 4.1 family) 29 (30) 17 (19) 11 (14) 0 0PF00880 Nebulin_repeat Nebulin repeat 4 (148) 1 (2) 1 0 0PF00681 Plectin_repeat Plectin repeat 2 (11) 0 0 0 0PF00435 Spectrin Spectrin repeat 31 (195) 13 (171) 10 (93) 0 0PF00418 Tubulin-binding Tau and MAP proteins, tubulin-binding 4 (12) 1 (4) 2 (8) 0 0PF00992 Troponin Troponin 4 6 8 0 0PF02209 VHP Villin headpiece domain 5 2 2 0 5PF01044 Vinculin Vinculin family 4 2 1 0 0

ECM adhesionPF01391 Collagen Collagen triple helix repeat (20 copies) 65 (279) 10 (46) 174 (384) 0 0PF01413 C4 C-terminal tandem repeated domain in type 4

procollagen6 (11) 2 (4) 3 (6) 0 0

PF00431 CUB CUB domain 47 (69) 9 (47) 43 (67) 0 0PF00008 EGF EGF-like domain 108 (420) 45 (186) 54 (157) 0 1PF00147 Fibrinogen_C Fibrinogen beta and gamma chains, C-terminal

globular domain26 10 (11) 6 0 0

PF00041 Fn3 Fibronectin type III domain 106 (545) 42 (168) 34 (156) 0 1PF00757 Furin-like Furin-like cysteine rich region 5 2 1 0 0PF00357 Integrin_A Integrin alpha cytoplasmic region 3 1 2 0 0PF00362 Integrin_B Integrins, beta chain 8 2 2 0 0PF00052 Laminin_B Laminin B (Domain IV) 8 (12) 4 (7) 6 (10) 0 0PF00053 Laminin_EGF Laminin EGF-like (Domains III and V) 24 (126) 9 (62) 11 (65) 0 0PF00054 Laminin_G Laminin G domain 30 (57) 18 (42) 14 (26) 0 0PF00055 Laminin_Nterm Laminin N-terminal (Domain VI) 10 6 4 0 0PF00059 Lectin_c Lectin C-type domain 47 (76) 23 (24) 91 (132) 0 0PF01463 LRRCT Leucine rich repeat C-terminal domain 69 (81) 23 (30) 7 (9) 0 0PF01462 LRRNT Leucine rich repeat N-terminal domain 40 (44) 7 (13) 3 (6) 0 0PF00057 Ldl_recept_a Low-density lipoprotein receptor domain class A 35 (127) 33 (152) 27 (113) 0 0PF00058 Ldl_recept_b Low-density lipoprotein receptor repeat class B 15 (96) 9 (56) 7 (22) 0 0PF00530 SRCR Scavenger receptor cysteine-rich domain 11 (46) 4 (8) 1 (2) 0 0PF00084 Sushi Sushi domain (SCR repeat) 53 (191) 11 (42) 8 (45) 0 0PF00090 Tsp_1 Thrombospondin type 1 domain 41 (66) 11 (23) 18 (47) 0 0PF00092 Vwa von Willebrand factor type A domain 34 (58) 0 17 (19) 0 1PF00093 Vwc von Willebrand factor type C domain 19 (28) 6 (11) 2 (5) 0 0PF00094 Vwd von Willebrand factor type D domain 15 (35) 3 (7) 9 0 0

Protein interaction domainsPF00244 14-3-3 14-3-3 proteins 20 3 3 2 15PF00023 Ank Ank repeat 145 (404) 72 (269) 75 (223) 12 (20) 66 (111)PF00514 Armadillo_seg Armadillo/beta-catenin-like repeats 22 (56) 11 (38) 3 (11) 2 (10) 25 (67)PF00168 C2 C2 domain 73 (101) 32 (44) 24 (35) 6 (9) 66 (90)PF00027 cNMP_binding Cyclic nucleotide-binding domain 26 (31) 21 (33) 15 (20) 2 (3) 22PF01556 DnaJ_C DnaJ C terminal region 12 9 5 3 19PF00226 DnaJ DnaJ domain 44 34 33 20 93PF00036 Efhand** EF hand 83 (151) 64 (117) 41 (86) 4 (11) 120 (328)PF00611 FCH Fes/CIP4 homology domain 9 3 2 4 0PF01846 FF FF domain 4 (11) 4 (10) 3 (16) 2 (5) 4 (8)PF00498 FHA FHA domain 13 15 7 13 (14) 17

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1340

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 39: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

myelin proteins result in severe demyelina-tion, which is a pathological condition inwhich the myelin is lost and the nerve con-duction is severely impaired (130). Humanshave at least 10 genes belonging to fourdifferent families involved in myelin produc-

tion (five myelin P0, three myelin proteolip-id, myelin basic protein, and myelin-oligo-dendrocyte glycoprotein, or MOG), and pos-sibly more-remotely related members of theMOG family. Flies have only a single myelinproteolipid, and worms have none at all.

Intercellular and intracellular signalingpathways in development and homeostasis.Many protein families that have expanded inhumans relative to the invertebrates are in-volved in signaling processes, particularly inresponse to development and differentiation

Table 18 (Continued )

Accessionnumber

Domain name Domain description H F W Y A

PF00254 FKBP FKBP-type peptidyl-prolyl cis-trans isomerases 15 (20) 7 (8) 7 (13) 4 24 (29)PF01590 GAF GAF domain 7 (8) 2 (4) 1 0 10PF01344 Kelch Kelch motif 54 (157) 12 (48) 13 (41) 3 102 (178)PF00560 LRR** Leucine Rich Repeat 25 (30) 24 (30) 7 (11) 1 15 (16)PF00917 MATH MATH domain 11 5 88 (161) 1 61 (74)PF00989 PAS PAS domain 18 (19) 9 (10) 6 1 13 (18)PF00595 PDZ PDZ domain (Also known as DHR or GLGF) 96 (154) 60 (87) 46 (66) 2 5PF00169 PH PH domain 193 (212) 72 (78) 65 (68) 24 23PF01535 PPR** PPR repeat 5 3 (4) 0 1 474 (2485)PF00536 SAM SAM domain (Sterile alpha motif) 29 (31) 15 8 3 6PF01369 Sec7 Sec7 domain 13 5 5 5 9PF00017 SH2 Src homology 2 (SH2) domain 87 (95) 33 (39) 44 (48) 1 3PF00018 SH3 Src homology 3 (SH3) domain 143 (182) 55 (75) 46 (61) 23 (27) 4PF01740 STAS STAS domain 5 1 6 2 13PF00515 TPR** TPR domain 72 (131) 39 (101) 28 (54) 16 (31) 65 (124)PF00400 WD40** WD40 domain 136 (305) 98 (226) 72 (153) 56 (121) 167 (344)PF00397 WW WW domain 32 (53) 24 (39) 16 (24) 5 (8) 11 (15)PF00569 ZZ ZZ-Zinc finger present in dystrophin, CBP/p300 10 (11) 13 10 2 10

Nuclear interaction domainsPF01754 Zf-A20 A20-like zinc finger 2 (8) 2 2 0 8PF01388 ARID ARID DNA binding domain 11 6 4 2 7PF01426 BAH BAH domain 8 (10) 7 (8) 4 (5) 5 21 (25)PF00643 Zf-B_box** B-box zinc finger 32 (35) 1 2 0 0PF00533 BRCT BRCA1 C Terminus (BRCT) domain 17 (28) 10 (18) 23 (35) 10 (16) 12 (16)PF00439 Bromodomain Bromodomain 37 (48) 16 (22) 18 (26) 10 (15) 28PF00651 BTB BTB/POZ domain 97 (98) 62 (64) 86 (91) 1 (2) 30 (31)PF00145 DNA_methylase C-5 cytosine-specific DNA methylase 3 (4) 1 0 0 13 (15)PF00385 Chromo chromo’ (CHRromatin Organization MOdifier)

domain24 (27) 14 (15) 17 (18) 1 (2) 12

PF00125 Histone Core histone H2A/H2B/H3/H4 75 (81) 5 71 (73) 8 48PF00134 Cyclin Cyclin 19 10 10 11 35PF00270 DEAD DEAD/DEAH box helicase 63 (66) 48 (50) 55 (57) 50 (52) 84 (87)PF01529 Zf-DHHC DHHC zinc finger domain 15 20 16 7 22PF00646 F-box** F-box domain 16 15 309 (324) 9 165 (167)PF00250 Fork_head Fork head domain 35 (36) 20 (21) 15 4 0PF00320 GATA GATA zinc finger 11 (17) 5(6) 8 (10) 9 26PF01585 G-patch G-patch domain 18 16 13 4 14 (15)PF00010 HLH** Helix-loop-helix DNA-binding domain 60 (61) 44 24 4 39PF00850 Hist_deacetyl Histone deacetylase family 12 5 (6) 8 (10) 5 10PF00046 Homeobox Homeobox domain 160 (178) 100 (103) 82 (84) 6 66PF01833 TIG IPT/TIG domain 29 (53) 11 (13) 5 (7) 2 1PF02373 JmjC JmjC domain 10 4 6 4 7PF02375 JmjN JmjN domain 7 4 2 3 7PF00013 KH-domain KH domain 28 (67) 14 (32) 17 (46) 4 (14) 27 (61)PF01352 KRAB KRAB box 204 (243) 0 0 0 0PF00104 Hormone_rec Ligand-binding domain of nuclear hormone

receptor47 17 142 (147) 0 0

PF00412 LIM LIM domain containing proteins 62 (129) 33 (83) 33 (79) 4 (7) 10 (16)PF00917 MATH MATH domain 11 5 88 (161) 1 61 (74)PF00249 Myb_DNA-binding Myb-like DNA-binding domain 32 (43) 18 (24) 17 (24) 15 (20) 243 (401)PF02344 Myc-LZ Myc leucine zipper domain 1 0 0 0 0PF01753 Zf-MYND MYND finger 14 14 9 1 7PF00628 PHD PHD-finger 68 (86) 40 (53) 32 (44) 14 (15) 96 (105)PF00157 Pou Pou domain—N-terminal to homeobox domain 15 5 4 0 0PF02257 RFX_DNA_binding RFX DNA-binding domain 7 2 1 1 0PF00076 Rrm RNA recognition motif (a.k.a. RRM, RBD, or RNP

domain)224 (324) 127 (199) 94 (145) 43 (73) 232 (369)

PF02037 SAP SAP domain 15 8 5 5 6 (7)PF00622 SPRY SPRY domain 44 (51) 10 (12) 5 (7) 3 6PF01852 START START domain 10 2 6 0 23PF00907 T-box T-box 17 (19) 8 22 0 0

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1341

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 40: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

(Tables 18 and 19). They include secretedhormones and growth factors, receptors, in-tracellular signaling molecules, and transcrip-tion factors.

Developmental signaling molecules that areenriched in the human genome include growthfactors such as wnt, transforming growth fac-tor–b (TGF-b), fibroblast growth factor (FGF),nerve growth factor, platelet derived growthfactor (PDGF), and ephrins. These growth fac-tors affect tissue differentiation and a widerange of cellular processes involving actin-cy-toskeletal and nuclear regulation. The corre-sponding receptors of these developmental li-gands are also expanded in humans. For exam-ple, our analysis suggests at least 8 humanephrin genes (2 in the fly, 4 in the worm) and 12ephrin receptors (2 in the fly, 1 in the worm). Inthe wnt signaling pathway, we find 18 wntfamily genes (6 in the fly, 5 in the worm) and12 frizzled receptors (6 in the fly, 5 in theworm). The Groucho family of transcriptionalcorepressors downstream in the wnt pathwayare even more markedly expanded, with 13predicted members in humans (2 in the fly, 1 inthe worm).

Extracellular adhesion molecules involvedin signaling are expanded in the human genome(Tables 18 and 19). The interactions of severalof these adhesion domains with extracellularmatrix proteoglycans play a critical role in hostdefense, morphogenesis, and tissue repair(131). Consistent with the well-defined role ofheparan sulfate proteoglycans in modulatingthese interactions (132), we observe an expan-sion of the heparin sulfate sulfotransferases inthe human genome relative to worm and fly.These sulfotransferases modulate tissue differ-entiation (133). A similar expansion in humansis noted in structural proteins that constitute theactin-cytoskeletal architecture. Compared withthe fly and worm, we observe an explosiveexpansion of the nebulin (35 domains per pro-tein on average), aggrecan (12 domains perprotein on average), and plectin (5 domains perprotein on average) repeats in humans. Theserepeats are present in proteins involved in mod-ulating the actin-cytoskeleton with predominantexpression in neuronal, muscle, and vasculartissues.

Comparison across the five sequenced eu-karyotic organisms revealed several expand-ed protein families and domains involved incytoplasmic signal transduction (Table 18).In particular, signal transduction pathwaysplaying roles in developmental regulation andacquired immunity were substantially en-riched. There is a factor of 2 or greater ex-pansion in humans in the Ras superfamilyGTPases and the GTPase activator and GTPexchange factors associated with them. Al-though there are about the same number oftyrosine kinases in the human and C. elegansgenomes, in humans there is an increase inthe SH2, PTB, and ITAM domains involvedin phosphotyrosine signal transduction. Fur-ther, there is a twofold expansion of phos-phodiesterases in the human genome com-pared with either the worm or fly genomes.

The downstream effectors of the intracellu-lar signaling molecules include the transcriptionfactors that transduce developmental fates. Sig-nificant expansions are noted in the ligand-binding nuclear hormone receptor class of tran-scription factors compared with the fly genome,although not to the extent observed in the worm(Tables 18 and 19). Perhaps the most strikingexpansion in humans is in the C2H2 zinc fingertranscription factors. Pfam detects a total of4500 C2H2 zinc finger domains in 564 humanproteins, compared with 771 in 234 fly proteins.This means that there has been a dramaticexpansion not only in the number of C2H2transcription factors, but also in the number ofthese DNA-binding motifs per transcriptionfactor (8 on average in humans, 3.3 on averagein the fly, and 2.3 on average in the worm).Furthermore, many of these transcription fac-tors contain either the KRAB or SCAN do-mains, which are not found in the fly or wormgenomes. These domains are involved in theoligomerization of transcription factors and in-crease the combinatorial partnering of thesefactors. In general, most of the transcriptionfactor domains are shared between the threeanimal genomes, but the reassortment of thesedomains results in organism-specific transcrip-tion factor families. The domain combinationsfound in the human, fly, and worm include theBTB with C2H2 in the fly and humans, and

homeodomains alone or in combination withPou and LIM domains in all of the animalgenomes. In plants, however, a different set oftranscription factors are expanded, namely, themyb family, and a unique set that includes VP1and AP2 domain–containing proteins (134).The yeast genome has a paucity of transcriptionfactors compared with the multicellular eu-karyotes, and its repertoire is limited to theexpansion of the yeast-specific C6 transcriptionfactor family involved in metabolic regulation.

While we have illustrated expansions in asubset of signal transduction molecules in thehuman genome compared with the other eu-karyotic genomes, it should be noted thatmost of the protein domains are highly con-served. An interesting observation is thatworms and humans have approximately thesame number of both tyrosine kinases andserine/threonine kinases (Table 19). It is im-portant to note, however, that these are mere-ly counts of the catalytic domain; the proteinsthat contain these domains also display awide repertoire of interaction domains withsignificant combinatorial diversity.

Hemostasis. Hemostasis is regulated pri-marily by plasma proteases of the coagulationpathway and by the interactions that occur be-tween the vascular endothelium and platelets.Consistent with known anatomical and physio-logical differences between vertebrates and in-vertebrates, extracellular adhesion domains thatconstitute proteins integral to hemostasis areexpanded in the human relative to the fly andworm (Tables 18 and 19). We note the evolu-tion of domains such as FIMAC, FN1, FN2,and C1q that mediate surface interactions be-tween hematopoeitic cells and the vascular ma-trix. In addition, there has been extensive re-cruitment of more-ancient animal-specific do-mains such as VWA, VWC, VWD, kringle,and FN3 into multidomain proteins that areinvolved in hemostatic regulation. Although wedo not find a large expansion in the total num-ber of serine proteases, this enzymatic domainhas been specifically recruited into several ofthese multidomain proteins for proteolytic reg-ulation in the vascular compartment. These arerepresented in plasma proteins that belong tothe kinin and complement pathways. There is a

Table 18 (Continued )

Accessionnumber

Domain name Domain description H F W Y A

PF02135 Zf-TAZ TAZ finger 2 (3) 1 (2) 6 (7) 0 10 (15)PF01285 TEA TEA domain 4 1 1 1 0PF02176 Zf-TRAF TRAF-type zinc finger 6 (9) 1 (3) 1 0 2PF00352 TBP Transcription factor TFIID (or TATA-binding

protein, TBP)2 (4) 4 (8) 2 (4) 1 (2) 2 (4)

PF00567 TUDOR TUDOR domain 9 (24) 9 (19) 4 (5) 0 2PF00642 Zf-CCCH Zinc finger C-x8-C-x5-C-x3-H type (and similar) 17 (22) 6 (8) 22 (42) 3 (5) 31 (46)PF00096 Zf-C2H2** ZInc finger, C2H2 type 564 (4500) 234 (771) 68 (155) 34 (56) 21 (24)PF00097 Zf-C3HC4 Zinc finger, C3HC4 type (RING finger) 135 (137) 57 88 (89) 18 298 (304)PF00098 Zf-CCHC Zinc knuckle 9 (17) 6 (10) 17 (33) 7 (13) 68 (91)

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1342

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 41: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

significant expansion in two families of matrixmetalloproteases: ADAM (a disintegrin andmetalloprotease) and MMPs (matrix metallo-proteases) (Table 19). Proteolysis of extracel-lular matrix (ECM) proteins is critical for tissuedevelopment and for tissue degradation in dis-eases such as cancer, arthritis, Alzheimer’s dis-ease, and a variety of inflammatory conditions(135, 136). ADAMs are a family of integralmembrane proteins with a pivotal role in fibrin-ogenolysis and modulating interactions be-tween hematopoietic components and thevascular matrix components. These proteinshave been shown to cleave matrix proteins,and even signaling molecules: ADAM-17converts tumor necrosis factor–a, andADAM-10 has been implicated in the Notchsignaling pathway (135). We have identified19 members of the matrix metalloproteasefamily, and a total of 51 members of theADAM and ADAM-TS families.

Apoptosis. Evolutionary conservation ofsome of the apoptotic pathway componentsacross eukarya is consistent with its centralrole in developmental regulation and as aresponse to pathogens and stress signals. Thesignal transduction pathways involved in pro-grammed cell death, or apoptosis, are medi-ated by interactions between well-character-ized domains that include extracellular do-mains, adaptor (protein-protein interaction)domains, and those found in effector andregulatory enzymes (137 ). We enumeratedthe protein counts of central adaptor and ef-fector enzyme domains that are found only inthe apoptotic pathways to provide an estimateof divergence across eukarya and relativeexpansion in the human genome when com-pared with the fly and worm (Table 18).Adaptor domains found in proteins restrictedonly to apoptotic regulation such as the DEDdomains are vertebrate-specific, whereas oth-ers like BIR, CARD, and Bcl2 are represent-ed in the fly and worm (although the numberof Bcl2 family members in humans is signif-icantly expanded). Although plants and yeastlack the caspases, caspase-like molecules,namely the para- and meta-caspases, havebeen reported in these organisms (138). Com-pared with other animal genomes, the humangenome shows an expansion in the adaptorand effector domain–containing proteins in-volved in apoptosis, as well as in the pro-teases involved in the cascade such as thecaspase and calpain families.

Expansions of other protein families.Metabolic enzymes. There are fewer cyto-chrome P450 genes in humans than in eitherthe fly or worm. Lipoxygenases (six in hu-mans), on the other hand, appear to be specificto the vertebrates and plants, whereas the lip-oxygenase-activating proteins (four in humans)may be vertebrate-specific. Lipoxygenases areinvolved in arachidonic acid metabolism, andthey and their activators have been implicated

in diverse human pathology ranging fromallergic responses to cancers. One of the mostsurprising human expansions, however, is inthe number of glyceraldehyde-3-phosphatedehydrogenase (GAPDH) genes (46 in hu-mans, 3 in the fly, and 4 in the worm). Thereis, however, evidence for many retrotrans-

posed GAPDH pseudogenes (139), whichmay account for this apparent expansion.However, it is interesting that GAPDH, longknown as a conserved enzyme involved inbasic metabolism found across all phyla frombacteria to humans, has recently been shownto have other functions. It has a second cat-

Table 19. Number of proteins assigned to selected Panther families or subfamilies in H. sapiens (H), D.melanogaster (F), C. elegans (W), S. cerevisiae (Y), and A. thaliana (A).

Panther family/subfamily* H F W Y A

Neural structure, function, development

Ependymin 1 0 0 0 0Ion channels

Acetylcholine receptor 17 12 56 0 0Amiloride-sensitive/degenerin 11 24 27 0 0CNG/EAG 22 9 9 0 30IRK 16 3 3 0 0ITP/ryanodine 10 2 4 0 0Neurotransmitter-gated 61 51 59 0 19P2X purinoceptor 10 0 0 0 0TASK 12 12 48 1 5Transient receptor 15 3 3 1 0Voltage-gated Ca21 alpha 22 4 8 2 2Voltage-gated Ca21 alpha-2 10 3 2 0 0Voltage-gated Ca21 beta 5 2 2 0 0Voltage-gated Ca21 gamma 1 0 0 0 0Voltage-gated K1 alpha 33 5 11 0 0Voltage-gated KQT 6 2 3 0 0Voltage-gated Na1 11 4 4 9 1

Myelin basic protein 1 0 0 0 0Myelin PO 5 0 0 0 0Myelin proteolipid 3 1 0 0 0Myelin-oligodendrocyte glycoprotein 1 0 0 0 0Neuropilin 2 0 0 0 0Plexin 9 2 0 0 0Semaphorin 22 6 2 0 0Synaptotagmin 10 3 3 0 0

Immune responseDefensin 3 0 0 0 0Cytokine† 86 14 1 0 0

GCSF 1 0 0 0 0GMCSF 1 0 0 0 0Intercrine alpha 15 0 0 0 0Intercrine beta 5 0 0 0 0Inteferon 8 0 0 0 0Interleukin 26 1 1 0 0Leukemia inhibitory factor 1 0 0 0 0MCSF 1 0 0 0 0Peptidoglycan recognition protein 2 13 0 0 0Pre-B cell enhancing factor 1 0 0 0 0Small inducible cytokine A 14 0 0 0 0Sl cytokine 2 0 0 0 0TNF 9 0 0 0 0

Cytokine receptor† 62 1 0 0 0Bradykinin/C-C chemokine receptor 7 0 0 0 0Fl cytokine receptor 2 0 0 0 0Interferon receptor 3 0 0 0 0Interleukin receptor 32 0 0 0 0Leukocyte tyrosine kinase

receptor3 0 0 0 0

MCSF receptor 1 0 0 0 0TNF receptor 3 0 0 0 0

Immunoglobulin receptor† 59 0 0 0 0T-cell receptor alpha chain 16 0 0 0 0T-cell receptor beta chain 15 0 0 0 0T-cell receptor gamma chain 1 0 0 0 0T-cell receptor delta chain 1 0 0 0 0Immunoglobulin FC receptor 8 0 0 0 0Killer cell receptor 16 0 0 0 0Polymeric-immunoglobulin receptor 4 0 0 0 0

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1343

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 42: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

alytic activity, as a uracil DNA glycosylase(140) and functions as a cell cycle regulator(141) and has even been implicated in apo-ptosis (142).

Translation. Another striking set of hu-man expansions has occurred in certain fam-ilies involved in the translational machinery.We identified 28 different ribosomal subunitsthat each have at least 10 copies in the ge-nome; on average, for all ribosomal proteinsthere is about an 8- to 10-fold expansion inthe number of genes relative to either theworm or fly. Retrotransposed pseudogenes

may account for many of these expansions[see the discussion above and (143)]. Recentevidence suggests that a number of ribosomalproteins have secondary functions indepen-dent of their involvement in protein biosyn-thesis; for example, L13a and the related L7subunits (36 copies in humans) have beenshown to induce apoptosis (144 ).

There is also a four- to fivefold expansionin the elongation factor 1-alpha family(eEF1A; 56 human genes). Many of theseexpansions likely represent intronless para-logs that have presumably arisen from retro-

transposition, and again there is evidence thatmany of these may be pseudogenes (145).However, a second form (eEF1A2) of thisfactor has been identied with tissue-specificexpression in skeletal muscle and a comple-mentary expression pattern to the ubiquitous-ly expressed eEF1A (146 ).

Ribonucleoproteins. Alternative splicingresults in multiple transcripts from a singlegene, and can therefore generate additionaldiversity in an organism’s protein comple-ment. We have identified 269 genes for ri-bonucleoproteins. This represents over 2.5times the number of ribonucleoprotein genesin the worm, two times that of the fly, andabout the same as the 265 identified in theArabidopsis genome. Whether the diversityof ribonucleoprotein genes in humans con-tributes to gene regulation at either the splic-ing or translational level is unknown.

Posttranslational modifications. In thisset of processes, the most prominent expan-sion is the transglutaminases, calcium-depen-dent enzymes that catalyze the cross-linkingof proteins in cellular processes such as he-mostasis and apoptosis (147 ). The vitaminK–dependent gamma carboxylase gene prod-uct acts on the GLA domain (missing in thefly and worm) found in coagulation factors,osteocalcin, and matrix GLA protein (148).Tyrosylprotein sulfotransferases participatein the posttranslational modification of pro-teins involved in inflammation and hemosta-sis, including coagulation factors and chemo-kine receptors (149). Although there is nosignificant numerical increase in the countsfor domains involved in nuclear protein mod-ification, there are a number of domain ar-rangements in the predicted human proteinsthat are not found in the other currently se-quenced genomes. These include the tandemassociation of two histone deacetylase do-mains in HD6 with a ubiquitin finger domain,a feature lacking in the fly genome. An ad-ditional example is the co-occurrence of im-portant nuclear regulatory enzyme PARP(poly-ADP ribosyl transferase) domain fusedto protein-interaction domains—BRCT andVWA in humans.

Concluding remarks. There are severalpossible explanations for the differences inphenotypic complexity observed in humanswhen compared to the fly and worm. Some ofthese relate to the prominent differences inthe immune system, hemostasis, neuronal,vascular, and cytoskeletal complexity. Thefinding that the human genome contains few-er genes than previously predicted might becompensated for by combinatorial diversitygenerated at the levels of protein architecture,transcriptional and translational control, post-translational modification of proteins, orposttranscriptional regulation. Extensive do-main shuffling to increase or alter combina-torial diversity can provide an exponential

Table 19 (Continued )

Panther family/subfamily* H F W Y A

MHC class I 22 0 0 0 0MHC class II 20 0 0 0 0Other immunoglobulin† 114 0 0 0 0Toll receptor–related 10 6 0 0 0

Developmental and homeostatic regulatorsSignaling molecules†

Calcitonin 3 0 0 0 0Ephrin 8 2 4 0 0FGF 24 1 1 0 0Glucagon 4 0 0 0 0Glycoprotein hormone beta chain 2 0 0 0 0Insulin 1 0 0 0 0Insulin-like hormone 3 0 0 0 0Nerve growth factor 3 0 0 0 0Neuregulin/heregulin 6 0 0 0 0neuropeptide Y 4 0 0 0 0PDGF 1 1 0 0 0Relaxin 3 0 0 0 0Stannocalcin 2 0 0 0 0Thymopoeitin 2 0 1 0 0Thyomosin beta 4 2 0 0 0TGF-b 29 6 4 0 0VEGF 4 0 0 0 0Wnt 18 6 5 0 0

Receptors†Ephrin receptor 12 2 1 0 0FGF receptor 4 4 0 0 0Frizzled receptor 12 6 5 0 0Parathyroid hormone receptor 2 0 0 0 0VEGF receptor 5 0 0 0 0BDNF/NT-3 nerve growth factor

receptor4 0 0 0 0

Kinases and phosphatasesDual-specificity protein phosphatase 29 8 10 4 11S/T and dual-specificity protein

kinase† 395 198 315 114 1102S/T protein phosphatase 15 19 51 13 29Y protein kinase† 106 47 100 5 16Y protein phosphatase 56 22 95 5 6

Signal transductionARF family 55 29 27 12 45Cyclic nucleotide phosphodiesterase 25 8 6 1 0G protein-coupled receptors†‡ 616 146 284 0 1G-protein alpha 27 10 22 2 5G-protein beta 5 3 2 1 1G-protein gamma 13 2 2 0 0Ras superfamily 141 64 62 26 86G-protein modulators†

ARF GTPase-activating 20 8 9 5 15Neurofibromin 7 2 0 2 0Ras GTPase-activating 9 3 8 1 0Tuberin 7 3 2 0 0Vav proto-oncogene family 35 15 13 3 0

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1344

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 43: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

increase in the ability to mediate protein-protein interactions without dramatically in-creasing the absolute size of the protein com-plement (150). Evolution of apparently new(from the perspective of sequence analysis)protein domains and increasing regulatorycomplexity by domain accretion both quanti-tatively and qualitatively (recruitment of nov-el domains with preexisting ones) are twofeatures that we observe in humans. Perhapsthe best illustration of this trend is the C2H2zinc finger–containing transcription factors,where we see expansion in the number ofdomains per protein, together with verte-brate-specific domains such as KRAB andSCAN. Recent reports on the prominent useof internal ribosomal entry sites in the humangenome to regulate translation of specificclasses of proteins suggests that this is an areathat needs further research to identify the fullextent of this process in the human genome(151). At the posttranslational level, althoughwe provide examples of expansions of someprotein families involved in these modifica-tions, further experimental evidence is re-quired to evaluate whether this is correlatedwith increased complexity in protein process-ing. Posttranscriptional processing and theextent of isoform generation in the humanremain to be cataloged in their entirety. Giventhe conserved nature of the spliceosomal ma-chinery, further analysis will be required todissect regulation at this level.

8 Conclusions

8.1 The whole-genome sequencingapproach versus BAC by BACExperience in applying the whole-genomeshotgun sequencing approach to a diversegroup of organisms with a wide range ofgenome sizes and repeat content allows us toassess its strengths and weaknesses. With thesuccess of the method for a large number ofmicrobial genomes, Drosophila, and now thehuman, there can be no doubt concerning theutility of this method. The large number ofmicrobial genomes that have been sequencedby this method (15, 80, 152) demonstrate thatmegabase-sized genomes can be sequencedefficiently without any input other that the denovo mate-paired sequences. With morecomplex genomes like those of Drosophila orhuman, map information, in the form of well-ordered markers, has been critical for long-range ordering of scaffolds. For joining scaf-folds into chromosomes, the quality of themap (in terms of the order of the markers) ismore important than the number of markersper se. Although this mapping could havebeen performed concurrently with sequenc-ing, the prior existence of mapping data wasbeneficial. During the sequencing of the A.thaliana genome, sequencing of individualBAC clones permitted extension of the se-

Table 19 (Continued )

Panther family/subfamily* H F W Y A

Transcription factors/chromatin organization

C2H2 zinc finger–containing† 607 232 79 28 8COE 7 1 1 0 0CREB 7 1 2 0 0ETS-related 25 8 10 0 0Forkhead-related 34 19 15 4 0FOS 8 2 1 0 0Groucho 13 2 1 0 0Histone H1 5 0 1 0 0Histone H2A 24 1 17 3 13Histone H2B 21 1 17 2 12Histone H3 28 2 24 2 16Histone H4 9 1 16 1 8Homeotic† 168 104 74 4 78

ABD-B 5 0 0 0 0Bithoraxoid 1 8 1 0 0Iroquois class 7 3 1 0 0Distal-less 5 2 1 0 0Engrailed 2 2 1 0 0LIM-containing 17 8 3 0 0MEIS/KNOX class 9 4 4 2 26NK-3/NK-2 class 9 4 5 0 0Paired box 38 28 23 0 2Six 5 3 4 0 0

Leucine zipper 6 0 0 0 0Nuclear hormone receptor† 59 25 183 1 4Pou-related 15 5 4 1 0Runt-related 3 4 2 0 0

ECM adhesion

Cadherin 113 17 16 0 0Claudin 20 0 0 0 0Complement receptor-related 22 8 6 0 0Connexin 14 0 0 0 0Galectin 12 5 22 0 0Glypican 13 2 1 0 0ICAM 6 0 0 0 0Integrin alpha 24 7 4 0 1Integrin beta 9 2 2 0 0LDL receptor family 26 19 20 0 2Proteoglycans 22 9 7 0 5

Apoptosis

Bcl-2 12 1 0 0 0Calpain 22 4 11 1 3Calpain inhibitor 4 0 0 0 1Caspase 13 7 3 0 0

Hemostasis

ADAM/ADAMTS 51 9 12 0 0Fibronectin 3 0 0 0 0Globin 10 2 3 0 3Matrix metalloprotease 19 2 7 0 3Serum amyloid A 4 0 0 0 0Serum amyloid P (subfamily of

Pentaxin)2 0 0 0 0

Serum paraoxonase/arylesterase 4 0 3 0 0Serum albumin 4 0 0 0 0Transglutaminase 10 1 0 0 0

Other enzymes

Cytochrome p450 60 89 83 3 256GAPDH 46 3 4 3 8Heparan sulfotransferase 11 4 2 0 0

Splicing and translation

EF-1alpha 56 13 10 6 13Ribonucleoproteins† 269 135 104 60 265Ribosomal proteins† 812 111 80 117 256

*The table lists Panther families or subfamilies relevant to the text that either (i) are not specifically represented by Pfam(Table 18) or (ii) differ in counts from the corresponding Pfam models. †This class represents a number of differentfamilies in the same Panther molecular function subcategory. ‡This count includes only rhodopsin-class, secretin-class, and metabotropic glutamate-class GPCRs.

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1345

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 44: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

quence well into centromeric regions and al-lowed high-quality resolution of complex re-peat regions. Likewise, in Drosophila, theBAC physical map was most useful in re-gions near the highly repetitive centromeresand telomeres. WGA has been found to de-liver excellent-quality reconstructions of theunique regions of the genome. As the genomesize, and more importantly the repetitive con-tent, increases, the WGA approach deliversless of the repetitive sequence.

The cost and overall efficiency of clone-by-clone approaches makes them difficult to justifyas a stand-alone strategy for future large-scalegenome-sequencing projects. Specific applica-tions of BAC-based or other clone mapping andsequencing strategies to resolve ambiguities insequence assembly that cannot be efficientlyresolved with computational approaches aloneare clearly worth exploring. Hybrid approachesto whole-genome sequencing will only work ifthere is sufficient coverage in both the whole-genome shotgun phase and the BAC clone se-quencing phase. Our experience with humangenome assembly suggests that this will requireat least 33 coverage of both whole-genome andBAC shotgun sequence data.

8.2 The low gene number in humansWe have sequenced and assembled ;95% ofthe euchromatic sequence of H. sapiens andused a new automated gene prediction meth-od to produce a preliminary catalog of thehuman genes. This has provided a major sur-prise: We have found far fewer genes (26,000to 38,000) than the earlier molecular pre-dictions (50,000 to over 140,000). Whateverthe reasons for this current disparity, onlydetailed annotation, comparative genomics(particularly using the Mus musculus ge-nome), and careful molecular dissection ofcomplex phenotypes will clarify this criticalissue of the basic “parts list” of our genome.Certainly, the analysis is still incomplete andconsiderable refinement will occur in theyears to come as the precise structure of eachtranscription unit is evaluated. A good placeto start is to determine why the gene esti-mates derived from EST data are so discor-dant with our predictions. It is likely that thefollowing contribute to an inflated gene num-ber derived from ESTs: the variable lengthsof 39- and 59-untranslated leaders and trailers;the little-understood vagaries of RNA pro-cessing that often leave intronic regions in anunspliced condition; the finding that nearly40% of human genes are alternatively spliced(153); and finally, the unsolved technicalproblems in EST library construction wherecontamination from heterogeneous nuclearRNA and genomic DNA are not uncommon.Of course, it is possible that there are genesthat remain unpredicted owing to the absenceof EST or protein data to support them, al-though our use of mouse genome data for

predicting genes should limit this number. Aswas true at the beginning of genome sequenc-ing, ultimately it will be necessary to measuremRNA in specific cell types to demonstratethe presence of a gene.

J. B. S. Haldane speculated in 1937 that apopulation of organisms might have to pay aprice for the number of genes it can possiblycarry. He theorized that when the number ofgenes becomes too large, each zygote carriesso many new deleterious mutations that thepopulation simply cannot maintain itself. Onthe basis of this premise, and on the basis ofavailable mutation rates and x-ray–inducedmutations at specific loci, Muller, in 1967(154 ), calculated that the mammalian ge-nome would contain a maximum of not muchmore than 30,000 genes (155). An estimate of30,000 gene loci for humans was also arrivedat by Crow and Kimura (156 ). Muller’s esti-mate for D. melanogaster was 10,000 genes,compared to 13,000 derived by annotation ofthe fly genome (26, 27 ). These arguments forthe theoretical maximum gene number werebased on simplified ideas of genetic load—that all genes have a certain low rate ofmutation to a deleterious state. However, it isclear that many mouse, fly, worm, and yeastknockout mutations lead to almost no dis-cernible phenotypic perturbations.

The modest number of human genesmeans that we must look elsewhere for themechanisms that generate the complexitiesinherent in human development and the so-phisticated signaling systems that maintainhomeostasis. There are a large number ofways in which the functions of individualgenes and gene products are regulated. Thedegree of “openness” of chromatin structureand hence transcriptional activity is regulatedby protein complexes that involve histoneand DNA enzymatic modifications. We enu-merate many of the proteins that are likelyinvolved in nuclear regulation in Table 19.The location, timing, and quantity of tran-scription are intimately linked to nuclear sig-nal transduction events as well as by thetissue-specific expression of many of theseproteins. Equally important are regulatoryDNA elements that include insulators, re-peats, and endogenous viruses (157 ); meth-ylation of CpG islands in imprinting (158);and promoter-enhancer and intronic regionsthat modulate transcription. The spliceosomalmachinery consists of multisubunit proteins(Table 19) as well as structural and catalyticRNA elements (159) that regulate transcriptstructure through alternative start and termi-nation sites and splicing. Hence, there is aneed to study different classes of RNA mol-ecules (160) such as small nucleolar RNAs,antisense riboregulator RNA, RNA involvedin X-dosage compensation, and other struc-tural RNAs to appreciate their precise role inregulating gene expression. The phenomenon

of RNA editing in which coding changesoccur directly at the level of mRNA is ofclinical and biological relevance (161). Final-ly, examples of translational control includeinternal ribosomal entry sites that are foundin proteins involved in cell cycle regulationand apoptosis (162). At the protein level,minor alterations in the nature of protein-protein interactions, protein modifications,and localization can have dramatic effects oncellular physiology (163). This dynamic sys-tem therefore has many ways to modulateactivity, which suggests that definition ofcomplex systems by analysis of single genesis unlikely to be entirely successful.

In situ studies have shown that the humangenome is asymmetrically populated withG1C content, CpG islands, and genes (68).However, the genes are not distributed quiteas unequally as had been predicted (Table 9)(69). The most G1C-rich fraction of the ge-nome, H3 isochores, constitute more of thegenome than previously thought (about 9%),and are the most gene-dense fraction, butcontain only 25% of the genes, rather than thepredicted ;40%. The low G1C L isochoresmake up 65% of the genome, and 48% of thegenes. This inhomogeneity, the net result ofmillions of years of mammalian gene dupli-cation, has been described as the “desertifi-cation” of the vertebrate genome (71). Whyare there clustered regions of high and lowgene density, and are these accidents of his-tory or driven by selection and evolution? Ifthese deserts are dispensable, it ought to bepossible to find mammalian genomes that arefar smaller in size than the human genome.Indeed, many species of bats have genomesizes that are much smaller than that of hu-mans; for example, Miniopterus, a species ofItalian bat, has a genome size that is only50% that of humans (164 ). Similarly, Mun-tiacus, a species of Asian barking deer, has agenome size that is ;70% that of humans.

8.3 Human DNA sequence variationand its distribution across the genomeThis is the first eukaryotic genome in which anearly uniform ascertainment of polymorphismhas been completed. Although we have identi-fied and mapped more than 3 million SNPs, thisby no means implies that the task of finding andcataloging SNPs is complete. These representonly a fraction of the SNPs present in thehuman population as a whole. Nevertheless,this first glimpse at genome-wide variation hasrevealed strong inhomogeneities in the distribu-tion of SNPs across the genome. Polymorphismin DNA carries with it a snapshot of the pastoperation of population genetic forces, includ-ing mutation, migration, selection, and geneticdrift. The availability of a dense array of SNPswill allow questions related to each of thesefactors to be addressed on a genome-wide basis.SNP studies can establish the range of haplo-

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1346

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 45: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

types present in subjects of different ethnogeo-graphic origins, providing insights into popula-tion history and migration patterns. Althoughsuch studies have suggested that modern humanlineages derive from Africa, many importantquestions regarding human origins remain un-answered, and more analyses using detailedSNP maps will be needed to settle these con-troversies. In addition to providing evidence forpopulation expansions, migration, and admix-ture, SNPs can serve as markers for the extentof evolutionary constraint acting on particulargenes. The correlation between patterns of in-traspecies and interspecies genetic variationmay prove to be especially informative to iden-tify sites of reduced genetic diversity that maymark loci where sequence variations are nottolerated.

The remarkable heterogeneity in SNPdensity implies that there are a variety offorces acting on polymorphism—sparse re-gions may have lower SNP density becausethe mutation rate is lower, because most ofthose regions have a lower fraction of muta-tions that are tolerated, or because recentstrong selection in favor of a newly arisenallele “swept” the linked variation out of thepopulation (165). The effect of random ge-netic drift also varies widely across the ge-nome. The nonrecombining portion of the Ychromosome faces the strongest pressurefrom random drift because there are roughlyone-quarter as many Y chromosomes in thepopulation as there are autosomal chromo-somes, and the level of polymorphism on theY is correspondingly less. Similarly, the Xchromosome has a smaller effective popu-lation size than the autosomes, and its nu-cleotide diversity is also reduced. But evenacross a single autosome, the effective pop-ulation size can vary because the density ofdeleterious mutations may vary. Regions ofhigh density of deleterious mutations willsee a greater rate of elimination by selec-tion, and the effective population size willbe smaller (166 ). As a result, the density ofeven completely neutral SNPs will be lowerin such regions. There is a large literatureon the association between SNP densityand local recombination rates in Drosoph-ila, and it remains an important task toassess the strength of this association in thehuman genome, because of its impact onthe design of local SNP densities for dis-ease-association studies. It also remains animportant task to validate SNPs on agenomic scale in order to assess the degreeof heterogeneity among geographic andethnic populations.

8.4 Genome complexityWe will soon be in a position to move awayfrom the cataloging of individual compo-nents of the system, and beyond the sim-plistic notions of “this binds to that, which

then docks on this, and then the complexmoves there. . . .” (167 ) to the exciting areaof network perturbations, nonlinear re-sponses and thresholds, and their pivotalrole in human diseases.

The enumeration of other “parts lists” re-veals that in organisms with complex nervoussystems, neither gene number, neuron number,nor number of cell types correlates in anymeaningful manner with even simplistic mea-sures of structural or behavioral complexity.Nor would they be expected to; this is the realmof nonlinearities and epigenesis (168). The 520million neurons of the common octopus exceedthe neuronal number in the brain of a mouse byan order of magnitude. It is apparent from acomparison of genomic data on the mouse andhuman, and from comparative mammalian neu-roanatomy (169), that the morphological andbehavioral diversity found in mammals is un-derpinned by a similar gene repertoire and sim-ilar neuroanatomies. For example, when onecompares a pygmy marmoset (which is only 4inches tall and weighs about 6 ounces) to achimpanzee, the brain volume of this minuteprimate is found to be only about 1.5 cm3, twoorders of magnitude less than that of a chimpand three orders less than that of humans. Yetthe neuroanatomies of all three brains are strik-ingly similar, and the behavioral characteristicsof the pygmy marmoset are little different fromthose of chimpanzees. Between humans andchimpanzees, the gene number, gene structuresand functions, chromosomal and genomic or-ganizations, and cell types and neuroanatomiesare almost indistinguishable, yet the develop-mental modifications that predisposed humanlineages to cortical expansion and developmentof the larynx, giving rise to language, culminat-ed in a massive singularity that by even thesimplest of criteria made humans more com-plex in a behavioral sense.

Simple examination of the number of neu-rons, cell types, or genes or of the genomesize does not alone account for the differenc-es in complexity that we observe. Rather, it isthe interactions within and among these setsthat result in such great variation. In addition,it is possible that there are “special cases” ofregulatory gene networks that have a dispro-portionate effect on the overall system. Wehave presented several examples of “regula-tory genes” that are significantly increased inthe human genome compared with the fly andworm. These include extracellular ligandsand their cognate receptors (e.g., wnt, friz-zled, TGF-b, ephrins, and connexins), as wellas nuclear regulators (e.g., the KRAB andhomeodomain transcription factor families),where a few proteins control broad develop-mental processes. The answers to these“complexities” perhaps lie in these expandedgene families and differences in the regulato-ry control of ancient genes, proteins, path-ways, and cells.

8.5 Beyond single componentsWhile few would disagree with the intuitiveconclusion that Einstein’s brain was morecomplex than that of Drosophila, closer com-parisons such as whether the set of predictedhuman proteins is more complex than theprotein set of Drosophila, and if so, to whatdegree, are not straightforward, since protein,protein domain, or protein-protein interactionmeasures do not capture context-dependentinteractions that underpin the dynamics un-derlying phenotype.

Currently, there are more than 30 differentmathematical descriptions of complexity (170).However, we have yet to understand the math-ematical dependency relating the number ofgenes with organism complexity. One pragmat-ic approach to the analysis of biological sys-tems, which are composed of nonidentical ele-ments (proteins, protein complexes, interactingcell types, and interacting neuronal popula-tions), is through graph theory (171). The ele-ments of the system can be represented by thevertices of complex topographies, with the edg-es representing the interactions between them.Examination of large networks reveals that theycan self-organize, but more important, they canbe particularly robust. This robustness is notdue to redundancy, but is a property of inho-mogeneously wired networks. The error toler-ance of such networks comes with a price; theyare vulnerable to the selection or removal of afew nodes that contribute disproportionately tonetwork stability. Gene knockouts provide anillustration. Some knockouts may have minoreffects, whereas others have catastrophic effectson the system. In the case of vimentin, a sup-posedly critical component of the cytoplasmicintermediate filament network of mammals, theknockout of the gene in mice reveals them to bereproductively normal, with no obvious pheno-typic effects (172), and yet the usually conspic-uous vimentin network is completely absent.On the other hand, ;30% of knockouts inDrosophila and mice correspond to criticalnodes whose reduction in gene product, or totalelimination, causes the network to crash mostof the time, although even in some of thesecases, phenotypic normalcy ensues, given theappropriate genetic background. Thus, there areno “good” genes or “bad” genes, but only net-works that exist at various levels and at differ-ent connectivities, and at different states ofsensitivity to perturbation. Sophisticated math-ematical analysis needs to be constantly evalu-ated against hard biological data sets that spe-cifically address network dynamics. Nowhere isthis more critical than in attempts to come togrips with “complexity,” particularly becausedeconvoluting and correcting complex net-works that have undergone perturbation, andhave resulted in human diseases, is the greatestsignificant challenge now facing us.

It has been predicted for the last 15 yearsthat complete sequencing of the human ge-

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1347

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 46: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

nome would open up new strategies for hu-man biological research and would have amajor impact on medicine, and through med-icine and public health, on society. Effects onbiomedical research are already being felt.This assembly of the human genome se-quence is but a first, hesitant step on a longand exciting journey toward understandingthe role of the genome in human biology. Ithas been possible only because of innova-tions in instrumentation and software thathave allowed automation of almost every stepof the process from DNA preparation to an-notation. The next steps are clear: We mustdefine the complexity that ensues when thisrelatively modest set of about 30,000 genes isexpressed. The sequence provides the frame-work upon which all the genetics, biochem-istry, physiology, and ultimately phenotypedepend. It provides the boundaries for scien-tific inquiry. The sequence is only the firstlevel of understanding of the genome. Allgenes and their control elements must beidentified; their functions, in concert as wellas in isolation, defined; their sequence varia-tion worldwide described; and the relationbetween genome variation and specific phe-notypic characteristics determined. Now weknow what we have to explain.

Another paramount challenge awaits:public discussion of this information and itspotential for improvement of personal health.Many diverse sources of data have shownthat any two individuals are more than 99.9%identical in sequence, which means that allthe glorious differences among individuals inour species that can be attributed to genesfalls in a mere 0.1% of the sequence. Thereare two fallacies to be avoided: determinism,the idea that all characteristics of the personare “hard-wired” by the genome; and reduc-tionism, the view that with complete knowl-edge of the human genome sequence, it isonly a matter of time before our understand-ing of gene functions and interactions willprovide a complete causal description of hu-man variability. The real challenge of humanbiology, beyond the task of finding out howgenes orchestrate the construction and main-tenance of the miraculous mechanism of ourbodies, will lie ahead as we seek to explainhow our minds have come to organizethoughts sufficiently well to investigate ourown existence.

References and Notes1. R. L. Sinsheimer, Genomics 5, 954 (1989); U.S. De-

partment of Energy, Office of Health and Environ-mental Research, Sequencing the Human Genome:Summary Report of the Santa Fe Workshop, SantaFe, NM, 3 to 4 March 1986 (Los Alamos NationalLaboratory, Los Alamos, NM, 1986).

2. R. Cook-Deegan, The Gene Wars: Science, Politics,and the Human Genome (Norton, New York, 1996).

3. F. Sanger et al., Nature 265, 687 (1977).4. P. H. Seeburg et al., Trans. Assoc. Am. Physicians 90,

109 (1977).

5. E. C. Strauss, J. A. Kobori, G. Siu, L. E. Hood, Anal.Biochem. 154, 353 (1986).

6. J. Gocayne et al., Proc. Natl. Acad. Sci. U.S.A. 84,8296 (1987).

7. A. Martin-Gallardo et al., DNA Sequence 3, 237(1992); W. R. McCombie et al., Nature Genet. 1, 348(1992); M. A. Jensen et al., DNA Sequence 1, 233(1991).

8. M. D. Adams et al., Science 252, 1651 (1991).9. M. D. Adams et al., Nature 355, 632 (1992); M. D.

Adams, A. R. Kerlavage, C. Fields, J. C. Venter, NatureGenet. 4, 256 (1993); M. D. Adams, M. B. Soares,A. R. Kerlavage, C. Fields, J. C. Venter, Nature Genet.4, 373 (1993); M. H. Polymeropoulos et al., NatureGenet. 4, 381 (1993); M. Marra et al., Nature Genet.21, 191 (1999).

10. M. D. Adams et al., Nature 377, 3 (1995); O. Whiteet al., Nucleic Acids Res. 21, 3829 (1993).

11. F. Sanger, A. R. Coulson, G. F. Hong, D. F. Hill, G. B.Petersen, J. Mol. Biol. 162, 729 (1982).

12. B. W. J. Mahy, J. J. Esposito, J. C. Venter, Am. Soc.Microbiol. News 57, 577 (1991).

13. R. D. Fleischmann et al., Science 269, 496 (1995).14. C. M. Fraser et al., Science 270, 397 (1995).15. C. J. Bult et al., Science 273, 1058 (1996); J. F. Tomb

et al., Nature 388, 539 (1997); H. P. Klenk et al.,Nature 390, 364 (1997).

16. J. C. Venter, H. O. Smith, L. Hood, Nature 381, 364(1996).

17. H. Schmitt et al., Genomics 33, 9 (1996).18. S. Zhao et al., Genomics 63, 321 (2000).19. X. Lin et al., Nature 402, 761 (1999).20. J. L. Weber, E. W. Myers, Genome Res. 7, 401 (1997).21. P. Green, Genome Res. 7, 410 (1997).22. E. Pennisi, Science 280, 1185 (1998).23. J. C. Venter et al., Science 280, 1540 (1998).24. M. D. Adams et al., Nature 368, 474 (1994).25. E. Marshall, E. Pennisi, Science 280, 994 (1998).26. M. D. Adams et al., Science 287, 2185 (2000).27. G. M. Rubin et al., Science 287, 2204 (2000).28. E. W. Myers et al., Science 287, 2196 (2000).29. F. S. Collins et al., Science 282, 682 (1998).30. International Human Genome Sequencing Consor-

tium (2001), Nature 409, 860 (2001).31. Institutional review board: P. Calabresi (chairman),

H. P. Freeman, C. McCarthy, A. L. Caplan, G. D.Rogell, J. Karp, M. K. Evans, B. Margus, C. L. Carter,R. A. Millman, S. Broder.

32. Eligibility criteria for participation in the study wereas follows: prospective donors had to be 21 years ofage or older, not pregnant, and capable of giving aninformed consent. Donors were asked to self-definetheir ethnic backgrounds. Standard blood bankscreens (screening for HIV, hepatitis viruses, and soforth) were performed on all samples at the clinicallaboratory prior to DNA extraction in the Celeralaboratory. All samples that tested positive fortransmissible viruses were ineligible and were dis-carded. Karyotype analysis was performed on pe-ripheral blood lymphocytes from all samples select-ed for sequencing; all were normal. A two-stagedconsent process for prospective donors was em-ployed. The first stage of the consent process pro-vided information about the genome project, pro-cedures, and risks and benefits of participating. Thesecond stage of the consent process involved an-swering follow-up questions and signing consentforms, and was conducted about 48 hours after thefirst.

33. DNA was isolated from blood (173) or sperm. Forsperm, a washed pellet (100 ml) was lysed in asuspension (1 ml) containing 0.1 M NaCl, 10 mMtris-Cl–20 mM EDTA (pH 8), 1% SDS, 1 mg protein-ase K, and 10 mM dithiothreitol for 1 hour at 37°C.The lysate was extracted with aqueous phenol andwith phenol/chloroform. The DNA was ethanol pre-cipitated and dissolved in 1 ml TE buffer. To makegenomic libraries, DNA was randomly sheared, end-polished with consecutive BAL31 nuclease and T4DNA polymerase treatments, and size-selected byelectrophoresis on 1% low-melting-point agarose.After ligation to Bst XI adapters (Invitrogen, catalogno. N408-18), DNA was purified by three rounds ofgel electrophoresis to remove excess adapters, andthe fragments, now with 39-CACA overhangs, were

inserted into Bst XI-linearized plasmid vector with39-TGTG overhangs. Libraries with three differentaverage sizes of inserts were constructed: 2, 10, and50 kbp. The 2-kbp fragments were cloned in ahigh-copy pUC18 derivative. The 10- and 50-kbpfragments were cloned in a medium-copy pBR322derivative. The 2- and 10-kbp libraries yielded uni-form-sized large colonies on plating. However, the50-kbp libraries produced many small colonies andinserts were unstable. To remedy this, the 50-kbplibraries were digested with Bgl II, which does notcleave the vector, but generally cleaved severaltimes within the 50-kbp insert. A 1264-bp Bam HIkanamycin resistance cassette (purified frompUCK4; Amersham Pharmacia, catalog no. 27-4958-01) was added and ligation was carried out at 37°Cin the continual presence of Bgl II. As Bgl II–Bgl IIligations occurred, they were continually cleaved,whereas Bam HI–Bgl II ligations were not cleaved. Ahigh yield of internally deleted circular library mol-ecules was obtained in which the residual insertends were separated by the kanamycin cassetteDNA. The internally deleted libraries, when platedon agar containing ampicillin (50 mg/ml), carbeni-cillin (50 mg/ml), and kanamycin (15 mg/ml), pro-duced relatively uniform large colonies. The result-ing clones could be prepared for sequencing usingthe same procedures as clones from the 10-kbplibraries.

34. Transformed cells were plated on agar diffusionplates prepared with a fresh top layer containing noantibiotic poured on top of a previously set bottomlayer containing excess antibiotic, to achieve thecorrect final concentration. This method of platingpermitted the cells to develop antibiotic resistancebefore being exposed to antibiotic without the po-tential clone bias that can be introduced throughliquid outgrowth protocols. After colonies hadgrown, QBot (Genetix, UK) automated colony-pick-ing robots were used to pick colonies meeting strin-gent size and shape criteria and to inoculate 384-well microtiter plates containing liquid growth me-dium. Liquid cultures were incubated overnight,with shaking, and were scored for growth beforepassing to template preparation. Template DNA wasextracted from liquid bacterial culture using a pro-cedure based upon the alkaline lysis miniprep meth-od (173) adapted for high throughput processing in384-well microtiter plates. Bacterial cells werelysed; cell debris was removed by centrifugation;and plasmid DNA was recovered by isopropanolprecipitation and resuspended in 10 mM tris-HClbuffer. Reagent dispensing operations were accom-plished using Titertek MAP 8 liquid dispensing sys-tems. Plate-to-plate liquid transfers were performedusing Tomtec Quadra 384 Model 320 pipetting ro-bots. All plates were tracked throughout processingby unique plate barcodes. Mated sequencing readsfrom opposite ends of each clone insert were ob-tained by preparing two 384-well cycle sequencingreaction plates from each plate of plasmid templateDNA using ABI-PRISM BigDye Terminator chemistry(Applied Biosystems) and standard M13 forwardand reverse primers. Sequencing reactions were pre-pared using the Tomtec Quadra 384-320 pipettingrobot. Parent-child plate relationships and, by ex-tension, forward-reverse sequence mate pairs wereestablished by automated plate barcode reading bythe onboard barcode reader and were recorded bydirect LIMS communication. Sequencing reactionproducts were purified by alcohol precipitation andwere dried, sealed, and stored at 4°C in the darkuntil needed for sequencing, at which time thereaction products were resuspended in deionizedformamide and sealed immediately to prevent deg-radation. All sequence data were generated using asingle sequencing platform, the ABI PRISM 3700DNA Analyzer. Sample sheets were created at loadtime using a Java-based application that facilitatesbarcode scanning of the sequencing plate barcode,retrieves sample information from the central LIMS,and reserves unique trace identifiers. The applica-tion permitted a single sample sheet file in thelinking directory and deleted previously createdsample sheet files immediately upon scanning of a

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1348

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 47: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

sample plate barcode, thus enhancing sample sheet-to-plate associations.

35. F. Sanger, S. Nicklen, A. R. Coulson, Proc. Natl. Acad.Sci. U.S.A. 74, 5463 (1977); J. M. Prober et al.,Science 238, 336 (1987).

36. Celera’s computing environment is based on Com-paq Computer Corporation’s Alpha system technol-ogy running the Tru64 Unix operating system.Celera uses these Alphas as Data Servers and asnodes in a Virtual Compute Farm, all of which areconnected to a fully switched network operating atFast Ethernet speed (for the VCF) and gigabit Eth-ernet speed (for data servers). Load balancing andscheduling software manages the submission andexecution of jobs, based on central processing unit(CPU) speed, memory requirements, and priority.The Virtual Compute Farm is composed of 440Alpha CPUs, which includes model EV6 running at aclock speed of 400 MHz and EV67 running at 667MHz. Available memory on these systems rangesfrom 2 GB to 8 GB. The VCF is used to manage tracefile processing, and annotation. Genome assemblywas performed on a GS 160 running 16 EV67s (667MHz) and 64 GB of memory, and 10 ES40s running4 EV6s (500 MHz) and 32 GB of memory. A total of100 terabytes of physical disk storage was includedin a Storage Area Network that was available tosystems across the environment. To ensure highavailability, file and database servers were config-ured as 4-node Alpha TruClusters, so that serviceswould fail over in the event of hardware or softwarefailure. Data availability was further enhanced byusing hardware- and software-based disk mirroring(RAID-0), disk striping (RAID-1), and disk stripingwith parity (RAID-5).

37. Trace processing generates quality values for basecalls by means of Paracel’s TraceTuner, trims se-quence reads according to quality values, trims vec-tor and adapter sequence from high-quality reads,and screens sequences for contaminants. Similar indesign and algorithm to the phred program (174),TraceTuner reports quality values that reflect thelog-odds score of each base being correct. Readquality was evaluated in 50-bp windows, each readbeing trimmed to include only those consecutive50-bp segments with a minimum mean accuracy of97%. End windows (both ends of the trace) of 1, 5,10, 25, and 50 bases were trimmed to a minimummean accuracy of 98%. Every read was furtherchecked for vector and contaminant matches of 50bp or more, and if found, the read was removedfrom consideration. Finally, any match to the 59vector splice junction in the initial part of a read wasremoved.

38. National Center for Biotechnology Information(NCBI); available at www.ncbi.nlm.nih.gov/.

39. NCBI; available at www.ncbi.nlm.nih.gov/HTGS/.40. All bactigs over 3 kbp were examined for coverage

by Celera mate pairs. An interval of a bactig wasdeemed an assembly error where there were nomate pairs spanning the interval and at least tworeads that should have their mate on the other sideof the interval but did not. In other words, there wasno mate pair evidence supporting a join in thebreakpoint interval and at least two mate pairscontradicting the join. By this criterion, we detectedand broke apart bactigs at 13,037 locations, orequivalently, we found 2.13% of the bactigs to bemisassembled.

41. We considered a BAC entry to be chimeric if, by theLander-Waterman statistic (175), the odds were0.99 or more that the assembly we produced wasinconsistent with the sequence coming from a sin-gle source. By this criterion, 714 or 2.2% of BACentries were deemed chimeric.

42. G. Myers, S. Selznick, Z. Zhang, W. Miller, J. Comput.Biol. 3, 563 (1996).

43. E. W. Myers, J. L. Weber, in Computational Methodsin Genome Research, S. Suhai, Ed. (Plenum, NewYork, 1996), pp. 73–89.

44. P. Deloukas et al., Science 282, 744 (1998).45. M. A. Marra et al., Genome Res. 7, 1072 (1997).46. J. Zhang et al., data not shown.47. Shredded bactigs were located on long CSA scaf-

folds (.500 kbp) and the distribution of these

fragments on the scaffolds was analyzed. If thespread of these fragments was greater than fourtimes the reported BAC length, the BAC was con-sidered to be chimeric. In addition, if .20% ofbactigs of a given BAC were found on a differentscaffolds that were not adjacent in map position,then the BAC was also considered as chimeric. Thetotal chimeric BACs divided by the number of BACsused for CSA gave the minimal estimate of chimer-ism rate.

48. M. Hattori et al., Nature 405, 311 (2000).49. I. Dunham et al., Nature 402, 489 (1999).50. A. B. Carvalho, B. P. Lazzaro, A. G. Clark, Proc. Natl.

Acad. Sci. U.S.A. 97, 13239 (2000).51. The International RH Mapping Consortium, available

at www.ncbi.nlm.nih.gov/genemap99/.52. See http://ftp.genome.washington.edu/RM/Repeat-

Masker.html.53. G. D. Schuler, Trends Biotechnol. 16, 456 (1998).54. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J.

Lipman, J. Mol. Biol. 215, 403 (1990).55a.M. Olivier et al., Science 291, 1298 (2001).55b.See http://genome.ucsc.edu/.56. N. Chaudhari, W. E. Hahn, Science 220, 924 (1983);

R. J. Milner, J. G. Sutcliffe, Nucleic Acids Res. 11,5497 (1983).

57. D. Dickson, Nature 401, 311 (1999).58. B. Ewing, P. Green, Nature Genet. 25, 232 (2000).59. H. Roest Crollius et al., Nature Genet. 25, 235

(2000).60. M. Yandell, in preparation.61. K. D. Pruitt, K. S. Katz, H. Sicotte, D. R. Maglott,

Trends Genet. 16, 44 (2000).62. Scaffolds containing greater than 10 kbp of se-

quence were analyzed for features of biologicalimportance through a series of computational steps,and the results were stored in a relational database.For scaffolds greater than one megabase, the se-quence was cut into single megabase pieces beforecomputational analysis. All sequence was maskedfor complex repeats using Repeatmasker (52) beforegene finding or homology-based analysis. The com-putational pipeline required ;7 hours of CPU timeper megabase, including repeat masking, or a totalcompute time of about 20,000 CPU hours. Proteinsearches were performed against the nonredundantprotein database available at the NCBI. Nucleotidesearches were performed against human, mouse,and rat Celera Gene Indices (assemblies of cDNAand EST sequences), mouse genomic DNA readsgenerated at Celera (33), the Ensembl gene data-base available at the European Bioinformatics Insti-tute (EBI), human and rodent (mouse and rat) ESTdata sets parsed from the dbEST database (NCBI),and a curated subset of the RefSeq experimentalmRNA database (NCBI). Initial searches were per-formed on repeat-masked sequence with BLAST 2.0(54) optimized for the Compaq Alpha compute-server and an effective database size of 3 3 109 forBLASTN searches and 1 3 109 for BLASTX searches.Additional processing of each query-subject pairwas performed to improve the alignments. All pro-tein BLAST results having an expectation score of,1 3 1024, human nucleotide BLAST results havingan expectation score of ,1 3 1028 with .94%identity, and rodent nucleotide BLAST results havingan expectation score of ,1 3 108 with .80%identity were then examined on the basis of theirhigh-scoring pair (HSP) coordinates on the scaffoldto remove redundant hits, retaining hits that sup-ported possible alternative splicing. For BLASTXsearches, analysis was performed separately for se-lected model organisms (yeast, mouse, human, C.elegans, and D. melanogaster) so as not to excludeHSPs from these organisms that support the samegene structure. Sequences producing BLAST hitsjudged to be informative, nonredundant, and suffi-ciently similar to the scaffold sequence were thenrealigned to the genomic sequence with Sim4 forESTs, and with Lap for proteins. Because both ofthese algorithms take splicing into account, theresulting alignments usually give a better represen-tation of intron-exon boundaries than standardBLAST analyses and thus facilitate further annota-tion (both machine and human). In addition to the

homology-based analysis described above, three abinitio gene prediction programs were used (63).

63. E. C. Uberbacher, Y. Xu, R. J. Mural, Methods Enzy-mol. 266, 259 (1996); C. Burge, S. Karlin, J. Mol.Biol. 268, 78 (1997); R. J. Mural, Methods Enzymol.303, 77 (1999); A. A. Salamov, V. V. Solovyev,Genome Res. 10, 516 (2000); Floreal et al., GenomeRes. 8, 967 (1998).

64. G. L. Miklos, B. John, Am. J. Hum. Genet. 31, 264(1979); U. Francke, Cytogenet. Cell Genet. 65, 206(1994).

65. P. E. Warburton, H. F. Willard, in Human GenomeEvolution, M. S. Jackson, T. Strachan, G. Dover, Eds.(BIOS Scientific, Oxford, 1996), pp. 121–145.

66. J. E. Horvath, S. Schwartz, E. E. Eichler, Genome Res.10, 839 (2000).

67. W. A. Bickmore, A. T. Sumner, Trends Genet. 5, 144(1989).

68. G. P. Holmquist, Am. J. Hum. Genet. 51, 17 (1992).69. G. Bernardi, Gene 241, 3 (2000).70. S. Zoubak, O. Clay, G. Bernardi, Gene 174, 95

(1996).71. S. Ohno, Trends Genet. 1, 160 (1985).72. K. W. Broman, J. C. Murray, V. C. Sheffield, R. L.

White, J. L. Weber, Am. J. Hum. Genet. 63, 861(1998).

73. M. J. McEachern, A. Krauskopf, E. H. Blackburn, Annu.Rev. Genet. 34, 331 (2000).

74. A. Bird, Trends Genet. 3, 342 (1987).75. M. Gardiner-Garden, M. Frommer, J. Mol. Biol. 196,

261 (1987).76. F. Larsen, G. Gundersen, R. Lopez, H. Prydz, Genom-

ics 13, 1095 (1992).77. S. H. Cross, A. Bird, Curr. Opin. Genet. Dev. 5, 309

(1995).78. J. Peters, Genome Biol. 1, reviews1028.1 (2000)

(http://genomebiology.com/2000/1/5/reviews/1028).

79. C. Grunau, W. Hindermann, A. Rosenthal, Hum. Mol.Genet. 9, 2651 (2000).

80. F. Antequera, A. Bird, Proc. Natl. Acad. Sci. U.S.A.90, 11995 (1993).

81. S. H. Cross et al., Mamm. Genome 11, 373 (2000).82. D. Slavov et al., Gene 247, 215 (2000).83. A. F. Smit, A. D. Riggs, Nucleic Acids Res. 23, 98

(1995).84. D. J. Elliott et al., Hum. Mol. Genet. 9, 2117 (2000).85. A. V. Makeyev, A. N. Chkheidze, S. A. Lievhaber,

J. Biol. Chem. 274, 24849 (1999).86. Y. Pan, W. K. Decker, A. H. H. M. Huq, W. J. Craigen,

Genomics 59, 282 (1999).87. P. Nouvel, Genetica 93, 191 (1994).88. I. Goncalves, L. Duret, D. Mouchiroud, Genome Res.

10, 672 (2000).89. Lek first compares all proteins in the proteome to

one another. Next, the resulting BLAST reports areparsed, and a graph is created wherein each proteinconstitutes a node; any hit between two proteinswith an expectation beneath a user-specifiedthreshold constitutes an edge. Lek then uses thisgraph to compute a similarity between each proteinpair ij in the context of the graph as a whole bysimply dividing the number of BLAST hits shared incommon between the two proteins by the totalnumber of proteins hit by i and j. This simple metrichas several interesting properties. First, because thesimilarity metric takes into account both the simi-larity and the differences between the two sequenc-es at the level of BLAST hits, the metric respects themultidomain nature of protein space. Two multido-main proteins, for instance, each containing do-mains A and B, will have a greater pairwise similarityto each other than either one will have to a proteincontaining only A or B domains, so long as A-B–containing multidomain proteins are less frequent inthe proteome than are single-domain proteins con-taining A or B domains. A second interesting prop-erty of this similarity metric is that it can be used toproduce a similarity matrix for the proteome as awhole without having to first produce a multiplealignment for each protein family, an error-proneand very time-consuming process. Finally, the met-ric does not require that either sequence have sig-nificant homology to the other in order to have adefined similarity to each other, only that they

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1349

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 48: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

share at least one significant BLAST hit in common.This is an especially interesting property of themetric, because it allows the rapid recovery of pro-tein families from the proteome for which no mul-tiple alignment is possible, thus providing a compu-tational basis for the extension of protein homologysearches beyond those of current HMM- and profile-based search methods. Once the whole-proteomesimilarity matrix has been calculated, Lek first par-titions the proteome into single-linkage clusters(27) on the basis of one or more shared BLAST hitsbetween two sequences. Next, these single-linkageclusters are further partitioned into subclusters,each member of which shares a user-specified pair-wise similarity with the other members of the clus-ter, as described above. For the purposes of thispublication, we have focused on the analysis ofsingle-linkage clusters and what we have termed“complete clusters,” e.g., those subclusters forwhich every member has a similarity metric of 1 toevery other member of the subcluster. We believethat the single-linkage and complete clusters are ofspecial interest, in part, because they allow us toestimate and to compare sizes of core protein setsin a rigorous manner. The rationale for this is asfollows: if one imagines for a moment a perfectclustering algorithm capable of perfectly partition-ing one or more perfectly annotated protein setsinto protein families, it is reasonable to assume thatthe number of clusters will always be greater than,or equal to, the number of single-linkage clusters,because single-linkage clustering is a maximally ag-glomerative clustering method. Thus, if there existsa single protein in the predicted protein set contain-ing domains A and B, then it will be clustered bysingle linkage together with all single-domain pro-teins containing domains A or B. Likewise, for apredicted protein set containing a single multido-main protein, the number of real clusters mustalways be less than or equal to the number ofcomplete clusters, because it is impossible to placea unique multidomain protein into a complete clus-ter. Thus, the single-linkage and complete clustersplus singletons should comprise a lower and upperbound of sizes of core protein sets, respectively,allowing us to compare the relative size and com-plexity of different organisms’ predicted protein set.

90. T. F. Smith, M. S. Waterman, J. Mol. Biol. 147, 195(1981).

91. A. L. Delcher et al., Nucleic Acids Res. 27, 2369(1999).

92. Arabidopsis Genome Initiative, Nature 408, 796(2000).

93. The probability that a contiguous set of proteins isthe result of a segmental duplication can be esti-mated approximately as follows. Given that proteinA and B occur on one chromosome, and that A9 andB9 (paralogs of A and B) also exist in the genome,the probability that B9 occurs immediately after A9is 1/N, where N is the number of proteins in the set(for this analysis, N 5 26,588). Allowing for B9 tooccur as any of the next J-1 proteins [leaving a gapbetween A9 and B9 increases the probability to ( J –1)/N; allowing B9A9 or A9B9 gives a probability of 2( J– 1)/N]. Considering three genes ABC, the probabil-ity of observing A9B9C9 elsewhere in the genome,given that the paralogs exist, is 1/N 2. Three pro-teins can occur across a spread of five positions insix ways; more generally, we compute the numberof ways that K proteins can be spread across Jpositions by counting all possible arrangements of K– 2 proteins in the J – 2 positions between the firstand last protein. Allowing for a spread to vary fromK positions (no gaps) to J gives

L 5 OX5K22

J22

S XK 2 2D

arrangements. Thus, the probability of chance occur-rence is L/NK–1. Allowing for both sets of genes (e.g.,ABC and A9B9C9) to be spread across J positionsincreases this to L2/NK–1. The duplicated segmentmight be rearranged by the operations of reversal ortranslocation; allowing for M such rearrangementsgives us a probability P 5 L2M/NK–1. For example, the

probability of observing a duplicated set of threegenes in two different locations, where the threegenes occur across a spread of five positions in bothlocations, is 36/N2; the expected number of suchmatched sets in the predicted protein set is approx-imately (N)36/N2 5 36/N, a value ,,1. Therefore,any such duplications of three genes are unlikely toresult from random rearrangements of the genome. Ifany of the genes occur in more than two copies, theprobability that the apparent duplication has oc-curred by chance increases. The algorithm for select-ing candidate duplications only generates matchedprotein sets with P ,, 1.

94. B. J. Trask et al., Hum. Mol. Genet. 7, 13 (1998); D.Sharon et al., Genomics 61, 24 (1999).

95. W. B. Barbazuk et al., Genome Res. 10, 1351 (2000);A. McLysaght, A. J. Enright, L. Skrabanek, K. H. Wolfe,Yeast 17, 22 (2000); D. W. Burt et al., Nature 402,411 (1999).

96. Reviewed in L. Skrabanek, K. H. Wolfe, Curr. Opin.Genet. Dev. 8, 694 (1998).

97. P. Taillon-Miller, Z. Gu, Q. Li, L. Hillier, P. Y. Kwok,Genome Res. 8, 748 (1998); P. Taillon-Miller, E. E.Piernot, P. Y. Kwok, Genome Res. 9, 499 (1999).

98. D. Altshuler et al., Nature 407, 513 (2000).99. G. T. Marth et al., Nature Genet. 23, 452 (1999).

100. W.-H. Li, Molecular Evolution (Sinauer, Sunderland,MA, 1997).

101. M. Cargill et al., Nature Genet. 22, 231 (1999).102. M. K. Halushka et al., Nature Genet. 22, 239 (1999).103. J. Zhang, T. L. Madden, Genome Res. 7, 649 (1997).104. M. Nei, Molecular Evolutionary Genetics (Columbia

Univ. Press, New York, 1987).105. From the observed coverage of the sequences at

each site for each individual, we calculated theprobability that a SNP would be detected at the siteif it were present. For each level of coverage, thereis a binomial sampling of the two homologs for eachindividual, and a heterozygous site could only beascertained if both homologs are present, or if twoalleles from different individuals are present. Withcoverage x from a given individual, both homologsare present in the assembly with probability 1 2(1/2)x21. Even if both homologs are present, theprobability that a SNP is detected is ,1 because afraction of sites failed the quality criteria. Integrat-ing over coverage levels, the binomial sampling, andthe quality distribution, we derived an expectednumber of sites in the genome that were ascer-tained for polymorphism for each individual. Thenucleotide diversity was then the observed numberof variable sites divided by the expected number ofsites ascertained.

106. M. W. Nachman, V. L. Bauer, S. L. Crowell, C. F.Aquadro, Genetics 150, 1133 (1998).

107. D. A. Nickerson et al., Nature Genet. 19, 233 (1998);D. A. Nickerson et al., Genomic Res. 10, 1532(2000); L. Jorde et al., Am. J. Hum. Genet. 66, 979(2000); D. G. Wang et al., Science 280, 1077 (1998).

108. M. Przeworski, R. R. Hudson, A. Di Rienzo, TrendsGenet. 16, 296 (2000).

109. S. Tavare, Theor. Popul. Biol. 26, 119 (1984).110. R. R. Hudson, in Oxford Surveys in Evolutionary

Biology, D. J. Futuyma, J. D. Antonovics, Eds. (OxfordUniv. Press, Oxford, 1990), vol. 7, pp. 1–44.

111. A. G. Clark et al., Am. J. Hum. Genet. 63, 595(1998).

112. M. Kimura, The Neutral Theory of Molecular Evolu-tion (Cambridge Univ. Press, Cambridge, 1983).

113. H. Kaessmann, F. Heissig, A. von Haeseler, S. Paabo,Nature Genet. 22, 78 (1999).

114. E. L. Sonnhammer, S. R. Eddy, R. Durbin, Proteins 28,405 (1997).

115. A. Bateman et al., Nucleic Acids Res. 28, 263 (2000).116. Brief description of the methods used to build the

Panther classification. First, the June 2000 release ofthe GenBank NR protein database (excluding se-quences annotated as fragments or mutants) waspartitioned into clusters using BLASTP. For the clus-tering, a seed sequence was randomly chosen, andthe cluster was defined as all sequences matchingthe seed to statistical significance (E-value , 1025)and “globally” alignable (the length of the matchregion must be .70% and ,130% of the length ofthe seed). If the cluster had more than five mem-

bers, and at least one from a multicellular eu-karyote, the cluster was extended. For the extensionstep, a hidden Markov Model (HMM) was trained forthe cluster, using the SAM software package, ver-sion 2. The HMM was then scored against GenBankNR (excluding mutants but including fragments forthis step), and all sequences scoring better than aspecific (NLL-NULL) score were added to the cluster.The HMM was then retrained (with fixed modellength) and all sequences in the cluster were alignedto the HMM to produce a multiple sequence align-ment. This alignment was assessed by a number ofquality measures. If the alignment failed the qualitycheck, the initial cluster was rebuilt around the seedusing a more restrictive E-value, followed by exten-sion, alignment, and reassessment. This process wasrepeated until the alignment quality was good. Themultiple alignment and “general” (i.e., describingthe entire cluster, or “family”) HMM (176) werethen used as input into the BETE program (177).BETE calculates a phylogenetic tree for the sequenc-es in the alignment. Functional information aboutthe sequences in each cluster were parsed fromSwissProt (178) and GenBank records. “Tree-at-tribute viewer” software was used by biologist cu-rators to correlate the phylogenetic tree with pro-tein function. Subfamilies were manually defined onthe basis of shared function across subtrees, andwere named accordingly. HMMs were then built foreach subfamily, using information from both thesubfamily and family (K. Sjolander, in preparation).Families were also manually named according to thefunctions contained within them. Finally, all of thefamilies and subfamilies were classified into cate-gories and subcategories based on their molecularfunctions. The categorization was done by manualreview of the family and subfamily names, by ex-amining SwissProt and GenBank records, and byreview of the literature as well as resources on theWorld Wide Web. The current version (2.0) of thePanther molecular function schema has four levels:category, subcategory, family, and subfamily. Pro-tein sequences for whole eukaryotic genomes (forthe predicted human proteins and annotated pro-teins for fly, worm, yeast, and Arabidopsis) werescored against the Panther library of family andsubfamily HMMs. If the score was significant (theNLL-NULL score cutoff depends on the protein fam-ily), the protein was assigned to the family orsubfamily function with the most significant score.

117. C. P. Ponting, J. Schultz, F. Milpetz, P. Bork, NucleicAcids Res. 27, 229 (1999).

118. A. Goffeau et al., Science 274, 546, 563 (1996).119. C. elegans Sequencing Consortium, Science 282,

2012 (1998).120. S. A. Chervitz et al., Science 282, 2022 (1998).121. E. R. Kandel, J. H. Schwartz, T. Jessell, Principles of

Neural Science (McGraw-Hill, New York, ed. 4,2000).

122. D. A. Goodenough, J. A. Goliger, D. L. Paul, Annu.Rev. Biochem. 65, 475 (1996).

123. D. G. Wilkinson, Int. Rev. Cytol. 196, 177 (2000).124. F. Nakamura, R. G. Kalb, S. M. Strittmatter, J. Neu-

robiol. 44, 219 (2000).125. P. J. Horner, F. H. Gage, Nature 407, 963 (2000); P.

Casaccia-Bonnefil, C. Gu, M. V. Chao, Adv. Exp. Med.Biol. 468, 275 (1999).

126. S. Wang, B. A. Barres, Neuron 27, 197 (2000).127. M. Geppert, T. C. Sudhof, Annu. Rev. Neurosci. 21,

75 (1998); J. T. Littleton, H. J. Bellen, Trends Neu-rosci. 18, 177 (1995).

128. A. Maximov, T. C. Sudhof, I. Bezprozvanny, J. Biol.Chem. 274, 24453 (1999).

129. B. Sampo et al., Proc. Natl. Acad. Sci. U.S.A. 97,3666 (2000).

130. G. Lemke, Glia 7, 263 (1993).131. M. Bernfield et al., Annu. Rev. Biochem. 68, 729

(1999).132. N. Perrimon, M. Bernfield, Nature 404, 725 (2000).133. U. Lindahl, M. Kusche-Gullberg, L. Kjellen, J. Biol.

Chem. 273, 24979 (1998).134. J. L. Riechmann et al., Science 290, 2105 (2000).135. T. L. Hurskainen, S. Hirohata, M. F. Seldin, S. S. Apte,

J. Biol. Chem. 274, 25555 (1999).

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1350

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 49: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

136. R. A. Black, J. M. White, Curr. Opin. Cell Biol. 10, 654(1998).

137. L. Aravind, V. M. Dixit, E. V. Koonin, Trends Biochem.Sci. 24, 47 (1999).

138. A. G. Uren et al., Mol. Cell 6, 961 (2000).139. P. Garcia-Meunier, M. Etienne-Julan, P. Fort, M. Piec-

haczyk, F. Bonhomme, Mamm. Genome 4, 695(1993).

140. K. Meyer-Siegler et al., Proc. Natl. Acad. Sci. U.S.A.88, 8460 (1991).

141. N. R. Mansur, K. Meyer-Siegler, J. C. Wurzer, M. A.Sirover, Nucleic Acids Res. 21, 993 (1993).

142. N. A. Tatton, Exp. Neurol. 166, 29 (2000).143. N. Kenmochi et al., Genome Res. 8, 509 (1998).144. F. W. Chen, Y. A. Ioannou, Int. Rev. Immunol. 18,

429 (1999).145. H. O. Madsen, K. Poulsen, O. Dahl, B. F. Clark, J. P.

Hjorth, Nucleic Acids Res. 18, 1513 (1990).146. D. M. Chambers, J. Peters, C. M. Abbott, Proc. Natl.

Acad. Sci. U.S.A. 95, 4463 (1998); A. Khalyfa, B. M.Carlson, J. A. Carlson, E. Wang, Dev. Dyn. 216, 267(1999).

147. D. Aeschlimann, V. Thomazy, Connect. Tissue Res.41, 1 (2000).

148. P. Munroe et al., Nature Genet. 21, 142 (1999); S. M.Wu, W. F. Cheung, D. Frazier, D. W. Stafford, Science254, 1634 (1991); B. Furie et al., Blood 93, 1798(1999).

149. J. W. Kehoe, C. R. Bertozzi, Chem. Biol. 7, R57(2000).

150. T. Pawson, P. Nash, Genes Dev. 14, 1027 (2000).151. A. W. van der Velden, A. A. Thomas, Int. J. Biochem.

Cell Biol. 31, 87 (1999).152. C. M. Fraser et al., Science 281, 375 (1998); H.

Tettelin et al., Science 287, 1809 (2000).153. D. Brett et al., FEBS Lett. 474, 83 (2000).154. H. J. Muller, H. Kern, Z. Naturforsch. B 22, 1330

(1967).

155. H. J. Muller, in Heritage from Mendel, R. A. Brink, Ed.(Univ. of Wisconsin Press, Madison, WI, 1967), p.419.

156. J. F. Crow, M. Kimura, Introduction to PopulationGenetics Theory (Harper & Row, New York, 1970).

157. K. Kobayashi et al., Nature 394, 388 (1998).158. A. P. Feinberg, Curr. Top. Microbiol. Immunol. 249,

87 (2000).159. C. A. Collins, C. Guthrie, Nature Struct. Biol. 7, 850

(2000).160. S. R. Eddy, Curr. Opin. Genet. Dev. 9, 695 (1999).161. Q. Wang, J. Khillan, P. Gadue, K. Nishikura, Science

290, 1765 (2000).162. M. Holcik, N. Sonenberg, R. G. Korneluk, Trends

Genet. 16, 469 (2000).163. T. A. McKinsey, C. L. Zhang, J. Lu, E. N. Olson, Nature

408, 106 (2000).164. E. Capanna, M. G. M. Romanini, Caryologia 24, 471

(1971).165. J. Maynard Smith, J. Theor. Biol. 128, 247 (1987).166. D. Charlesworth, B. Charlesworth, M. T. Morgan,

Genetics 141, 1619 (1995).167. J. E. Bailey, Nature Biotechnol. 17, 616 (1999).168. R. Maleszka, H. G. de Couet, G. L. Miklos, Proc. Natl.

Acad. Sci. U.S.A. 95, 3731 (1998).169. G. L. Miklos, J. Neurobiol. 24, 842 (1993).170. J. P. Crutchfield, K. Young, Phys. Rev. Lett. 63, 105

(1989); M. Gell-Mann, S. Lloyd, Complexity 2, 44(1996).

171. A. L. Barabasi, R. Albert, Science 286, 509 (1999).172. E. Colucci-Guyon et al., Cell 79, 679 (1994).173. J. Sambrook, E. F. Fritch, T. Maniatis, Molecular

Cloning: A Laboratory Manual (Cold Spring HarborLaboratory Press, Cold Spring Harbor, NY, ed. 2,1989).

174. B. Ewing, P. Green, Genome Res. 8, 186 (1998); B.

Ewing, L. Hillier, M. C. Wendl, P. Green, Genome Res.8, 175 (1998).

175. E. S. Lander, M. S. Waterman, Genomics 2, 231(1988).

176. A. Krogh, K. Sjolander, J. Mol. Biol. 235, 1501(1994).

177. K. Sjolander, Proc. Int. Soc. Mol. Biol. 6, 165 (1998).178. A. Bairoch, R. Apweiler, Nucleic Acids Res. 28, 45

(2000).179. GO, available at www.geneontology.org/.180. R. L. Tatusov, M. Y. Galperin, D. A. Natale, E. V.

Koonin, Nucleic Acids Res. 28, 33 (2000).181. We thank E. Eichler and J. L. Goldstein for many

helpful discussions and critical reading of the manu-script, and A. Caplan for advice and encouragement.We also thank T. Hein, D. Lucas, G. Edwards, and theCelera IT staff for outstanding computational sup-port. The cost of this project was underwritten bythe Celera Genomics Group of the Applera Corpo-ration. We thank the Board of Directors of AppleraCorporation: J. F. Abely Jr. (retired), R. H. Ayers, J.-L.Belingard, R. H. Hayes, A. J. Levine, T. E. Martin, C. W.Slayman, O. R. Smith, G. C. St. Laurent Jr., and J. R.Tobin for their vision, enthusiasm, and unwaveringsupport and T. L. White for leadership and advice.Data availability: The genome sequence and addi-tional supporting information are available to aca-demic scientists at the Web site (www.celera.com).Instructions for obtaining a DVD of the genomesequence can be obtained through the Web site. Forcommercial scientists wishing to verify the resultspresented here, the genome data are available uponsigning a Material Transfer Agreement, which canalso be found on the Web site.

5 December 2000; accepted 19 January 2001

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1351

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om

Page 50: The Sequence of the Human Genome Science 291, 1304 (2001);montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class3/V… · Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy

1www.sciencemag.org SCIENCE Erratum post date 8 JUNE 2001

post date 8 June 2001

ERRATUM

C O R R E C T I O N S A N D C L A R I F I C A T I O N S

RREPOREPORTSTS: “The sequence of the human genome” by J. C. Venter et al.

(16 Feb. 2001, p. 1304). In Table 10, the last column under the head-

ing “Gene prediction” should have read “Total (Otto + de novo/2×).”

This section of the table with the corrected column heading is

shown here. The asterisk indicates that the chromosomal assignment

is unknown.

In the References and Notes section, the authors for reference 176

should have read “A. Krogh et al.”; the journal name in reference 177

should have been “Proc. Intell. Syst. Mol. Biol.”; and in note 181, the

acknowledgement list should have included after G. Edwards the

names L. Foster, D. Bhandari, P. Davies, T. Safford, and J. Schira.

Gene prediction*

OttoDe

novo/any

Denovo/

2�

Total(Otto� denovo/any)

Total(Otto� denovo/ )

1,743 1,710 710 3,453 2,4531,183 1,771 633 2,954 1,8161,013 1,414 598 2,427 1,611

696 1,165 449 1,861 1,145892 1,244 474 2,136 1,366943 1,314 524 2,257 1,467759 1,072 460 1,831 1,219583 977 357 1,560 940689 848 329 1,537 1,018685 968 342 1,653 1,027

1,051 1,134 535 2,185 1,586925 936 417 1,861 1,342341 691 241 1,032 582583 700 290 1,283 873558 640 246 1,198 804748 673 247 1,421 995897 648 313 1,545 1,210283 543 189 826 472

1,141 534 268 1,675 1,409517 469 180 986 697184 265 102 449 286494 341 147 835 641605 860 387 1,465 992

55 155 49 210 104196 278 132 474 328

17,764 21,350 8,619 39,114 26,383714 812 333 1,526 1,047

2�

on

Aug

ust 3

1, 2

009

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fr

om