Review genome of Caenorhabditis elegans · theentire C. elegans researchcommunity. The communal approach is important in two ways. Fromthe point ofview of the map,it ensuresthatthespecializedknowl-edge

Proc. Natl. Acad. Sci. USAVol. 92, pp. 10836-10840, November 1995

Review

The genome of Caenorhabditis elegansR. Waterston* and J. Sulstont*Department of Genetics and Genome Sequencing Center, Washington University School of Medicine, St. Louis, MO 63110; and tSanger Centre, Hinxton Hall,Cambs CB10 1RQ, United Kingdom

ABSTRACT The physical map of the100-Mb Caenorhabditis elegans genomeconsists of 17,500 cosmids and 3500 yeastartificial chromosomes (YACs). A total of22.5 Mb has been sequenced, with theremainder expected by 1998. A further15.5 Mb of unfinished sequence is freelyavailable online: because the areas se-quenced so far are relatively gene rich,about half the 13,000 genes can now bescanned. More than a quarter of the genesare represented by expressed sequencetags (ESTs). All information pertainingto the genome is publicly available in theACeDB data base.

Since its inception in the early 1980s, theCaenorhabditis elegans genome project haspioneered approaches to physical map-ping and genomic sequencing. Today thephysical map of the 100-Mb genome isamong the largest and most complete yetconstructed. More genomic sequence andmore sequenced genes are now availablefrom the worm than from any other or-ganism, and we are on target for comple-tion of the full genomic sequence by theend of 1998. The utility of these resourcescan be judged by the many C. eleganslaboratories now using the map and se-quence to study mutationally definedgenes and by the use of sequence homol-ogies by many more laboratories. Thegenome sequence is critical to gaining athorough understanding of this importantmodel organism and will aid in studies ofhuman disease.

In this review, we shall attempt to de-scribe the underlying philosophy and thegeneral approaches that we feel have beenimportant for the success of the project.These points are applicable not only toother small genome projects but also tothe much larger and more challenginghuman genome project.

The Genome Map

The physical map consists largely of over-lapping cosmid and yeast artificial chro-mosome (YAC) clones (1-3). Both com-ponents are essential: the YACs, by virtueof their large inserts and propagation inyeast, provide long-range continuity andcan hold DNA that is unclonable in bac-terial cosmid clones, while the cosmidsprovide high resolution locally and a more

convenient substrate for biochemistry.Regions that are unclonable in cosmidsare often rich in repetitive sequences andrelatively poor in genes.The principal techniques for physical

mapping have been restriction enzyme-based fingerprinting for construction ofcosmid contigs; YAC-cosmid hybridiza-tions to gridded arrays; sequence-taggedsite (STS) assays for direct detection ofYAC-YAC overlaps; and hybridization toC. elegans chromosomes for long-rangeordering (20). The topological constraintsimposed by the ordered cosmids wereimportant for the interpretation of YAC-cosmid hybridizations in that they helpedto distinguish genuine matches from spu-rious matches due to repetitive sequences.

This physical array of cloned DNAs ismade into a genome map by the wealth ofgenetic markers that have been attachedto specific clones. This was facilitated bythe early and unrestricted distribution ofthe clone resources and by the researchcommunity's readiness to share informa-tion prior to publication. In fact, the ge-netic matches were also critical mappingtools in that they, along with in situ hy-bridization, provide the longest rangelinkage. Unlike the physical mapping,which was carried out mainly by two cen-tral laboratories, the genetic linkage hasbeen achieved by the cooperative effort ofthe entire C. elegans research community.The communal approach is important intwo ways. From the point of view of themap, it ensures that the specialized knowl-edge of all individuals and groups isbrought to bear on the project. From thepoint of view of effort and funding, itmeans that everyone is involved and al-lows the central resources to be as leanand focused as possible.The map as a whole gains from being a

multilevel construct; no single techniqueis sufficient by itself to provide full link-age, and strength arises from partial re-dundancy between the levels. This is im-portant because all mapping informationis to some degree stochastic.The fingerprinting method is readily

scalable and is being applied increasinglyto the human genome. Since the originalnematode work, the procedure has beenautomated, and new generations of assem-bly software are appearing. In contrast tothe worm, long-range order in the humangenome is being achieved at an earlier

10836

stage by STS analysis of YACs and radi-ation hybrids and by in situ hybridization.However, just as in the worm, bacterialclones [whether cosmids, Pls, P1 artificialchromosomes (PACs), or bacterial artifi-cial chromosomes (BACs) (21, 22)] willprovide the preferred substrates for bio-chemistry.

Genome Sequence

In sequencing the genome, we have as faras possible adopted the same philosophyof collective endeavor. The task of the twocentral laboratories is restricted to collect-ing the data as efficiently as possible. Theystrictly refrain from exploiting the se-quence for their own research purposesbefore its release. As soon as the raw datahas been assembled into contigs, it isavailable for screening by anyone [by be-ing placed in an anonymous file transferprotocol (ftp) server]. When each cosmidsequence is finished, it is analyzed bycomputer with supplemental human inter-pretation; the annotated sequence is thenimmediately submitted to GenBank orEMBL, as well as being placed in the C.elegans data base ACeDB (4). In this way,the expertise not only of the worm com-munity but of the whole world is broughtto bear at the earliest possible stage.

Sequencing at present is concentratedon the central regions of the autosomesand the whole of the X chromosome to-taling roughly 60% of the genome. Ge-netic and cDNA mapping data indicatethat these areas contain the majority ofthe genes (perhaps as many as 90% of thetotal) (refs. 5 and 6; Y. Kohara, personalcommunication), and, therefore, theproject will deliver high biological value asrapidly as possible. However, this does notweaken the plan to sequence the wholegenome; there are many important as-pects beyond the acquisition of proteincoding sequence.

Starting in this way has the added ben-efit for us that we are initially sequencingcosmids. Later, we shall have to deal withthe "YAC bridges"-i.e., the regions thatare cloned in YACs but not in cosmids-and we are experimenting with ways ofdoing this. Starting from whole YACs,

Abbreviations: YAC, yeast artificial chromo-some; STS, sequence-tagged site.

Dow

nloa

ded

by g

uest

on

June

30,

202

0

Proc. Natl. Acad. Sci. USA 92 (1995) 10837

though successfully done (7), is difficult atpresent because of the limited amountsand purity of material and the greaterproblems from repeated sequences. How-ever, with improved software and theavailability of the yeast sequence, thismethod may become more practical.Smaller bridges have been successfullyrescued by recombination from YACs inyeast, others by long-range PCR, andmany regions are susceptible to cloning inA vectors. During the next year or so, weshall continue to experiment with theseand other methods to determine the mostefficient approach.Our principal sequencing strategy is an

initial shotgun followed by directed fin-ishing (described in detail in ref. 8). It isworth emphasizing here that the shotgunsequences provide extremely detailed mapinformation, albeit only from a fraction ofthe subclone insert length. This map in-formation allows the relative positions ofthe random subclones to be established,while at the same time producing the bulkof the sequence information. Like all map-ping methods, it is vulnerable to repeatedsequences. The density and accuracy ofsequence information, however, compen-sate for the relatively short length of in-dividual reads. Improved assembly pro-grams (see below) are taking greater ad-vantage of this information, and thesequence itself provides powerful meansof evaluating alternative maps. In uncer-tain areas, additional sequence, for exam-ple from the opposite end of the insert,can provide additional map information.In addition, restriction enzyme digests ofthe parent clone provide a simple anddirect means of testing overall map accu-racy.

Cost and accuracy are key consider-ations in evaluating the effectiveness ofany strategy. Current direct and indirectcosts for the production of the final an-notated sequence are approaching $0.45 abase, and total costs of all activities (in-cluding development and related researchefforts) are below $0.70 a base. In general,the accuracy of the sequence appears to bebetter than 99.99% judging from compar-isons with previously sequenced genes,although these estimates are of limitedreliability in the absence of a truly inde-pendent means of checking the sequence.Throughput and the ability to scale the

effort are equally important. The com-bined sequencing rate of the two labora-tories should exceed 20 Mb for the currentcalendar year. This is set to double overthe next 18 months, leading to completionin 1996 of the gene-rich regions and in1997 of the entire set of available cos-mids-i.e., 80% of the genome. Thethroughput is likely to drop during 1997and 1998, as the remaining gaps are tack-led. Realistically, a small mopping up op-eration may be needed in 1998 and 1999,as the last 1-2% of the genome (involving

highly repetitive sequences) is mapped outand representative data are obtained.

It must be made clear that our objectiveis to extract the information from thegenome rather than to exhaustively se-quence every last base. For example, evennow we sometimes describe long tandemrepeats in terms of the consensus andnumber of copies. The occurrence of suchinstances is likely to be more frequent aswe move into repetitive regions, andtherefore, this method of reporting willincrease. Conversely, we sequence allother regions as accurately as possible.The shotgun/directed approach can be

applied equally well to the human ge-nome, provided that the extensive repeatfamilies are allowed for in the assemblyalgorithm. At first, we adapted R. Staden's(23) XGAP by screening the input so thatALU sequences were excluded from theinitial assembly process. We now beginwith P. Green's (University of Washing-ton, Seattle) PHRAP, which makes positivebut selective use of repeat sequences inassembly, and then feed the results toXGAP or other editing programs. Given thelower gene density of the human, it isworthwhile to cut costs by deliberatelycurtailing the finishing process. Many ofthe problem areas that take the most timeto complete are not of immediate biolog-ical interest, and therefore time andmoney can be saved by accurately map-ping around them. Should a particulararea prove of importance in the future, itcan be retrieved by PCR or as a restrictionfragment for further investigation.

Status and Applications

The current status of the sequencing ef-fort is summarized in Table 1. All of thesequence has been subjected to a series ofprograms to provide an initial interpreta-

tion of its features. Comparison with ex-pressed sequence tags (ESTs; 11,522 atpresent from more than 3000 genes) (ref.6; Y. Kohara, personal communication)allows the construction of a confirmedtranscription map for a quarter of thepredicted genes.We have also used the EST information

to estimate the number of genes in theentire genome. This can be done by divid-ing the number of predicted genes in agiven sequenced region (1131) by the frac-tion of the cDNA library for which exactmatches have been found in that region(1000/11522 = 0.087). It makes no differ-ence to the calculation that only about aquarter of the genes are represented in thecDNA library. The result, 13,500 genes(±500 at the 95% confidence level), restson the assumption that the expression ofgenes in the sequenced region is typical ofthe genome as a whole; this is likely to becloser to the truth than assuming a uni-form gene density.The genome map is being strengthened

by several other systematic studies. Theseprojects are entirely independent, buttheir findings are united through the map,and some of them draw on its resources.At the University of Leeds, England, IanHope is collecting expression data by us-ing transgenic reporter constructs of thepredicted genes (refs. 10 and 11; I. Hope,personal communication). Targeted genedisruption by transposon insertion andexcision was pioneered by Ronald Plasterk(Netherlands Cancer Institute, Amster-dam) and is now carried out in a numberof laboratories; in this way, functionalityfor the predicted genes can be determined(12). At the National Genetics Institute(Mishima, Japan), Yuji Kohara is contin-ually adding to his set of sequence-taggedcDNAs and is determining their expres-sion patterns by in situ hybridization. In

Table 1. Current state of C. elegans 100-Mb genome sequencing project

Physical map 17,500 cosmids3,500 YACsFive autosomes: total of eight gaps (all in gene poor regions)X chromosome: single contig of =18 Mb

DNA sequence 22.5 Mb completed as of 9/11/95Chromosome II: 6.7 MbChromosome III: 7.2 MbChromosome X: 8.3 MbChromosome IV: 0.3 Mb

4255 putative protein coding genes (=1 per each 5 kb);approximately 45% have significant similarity to non-C. elegans genes.

As well as having access to finished sequences in the GenBank, EMBL, and DDBJ data bases,investigators can search these sequence data and also more preliminary unfinished sequences atthe two genome centers. Currently, the searchable data base contains 38 Mb of sequence data(22.5 Mb finished, 15.5 Mb unfinished) which is estimated to contain about half of the genes inC. elegans. Searches with a nucleotide or protein query use the BLAST programs (9) and aresubmitted via a World Wide Web interface (see URL http://www.sanger.ac.uk/). This servicewill soon be available at St Louis (http://genome.wustl.edu/). The Web pages provide additionalinformation about the genome and include help addresses. All data pertaining to the genome(including genetic and physical maps and the sequence) are combined in the data base ACeDB.The latest release can be obtained by anonymous ftp from the following: U.S.A., ncbi.nlm.nih.gov(130.14.20.1) in repository/acedb; U.K., ftp.sanger.ac.uk (193.60.84.11) in pub/acedb; or France,lirmm.lirmm.fr (193.49.104.10) in genome/acedb.

Review: Waterston and Sulston

Dow

nloa

ded

by g

uest

on

June

30,

202

0

10838 Review: Waterston and Sulston

Vancouver, Canada, David Baillie (SimonFraser University) and Ann Rose (Uni-versity of British Columbia) are generat-ing transgenic strains incorporating se-quenced cosmids; rescue of lethal andvisible mutants by these strains will allowprecise correlation of the genetic mapwith the sequence (13-15).By far the largest use of the genome

map and sequence is for the study ofspecific C. elegans genes. Virtually all the150 C. elegans laboratories make use of themap, and most of them request clones

from it. Increasingly, laboratories workingon other organisms are also using theseresources.

Applications of the sequence informa-tion are on a variety of levels. At the mostmundane level, the prior determination ofthe sequence by an efficient large-scaleoperation simply saves subsequent effortand resources. Importantly, the sequenceprovides ready-made tools, such as a re-striction map and information for makingprimers, that facilitate experimental de-sign.

More creatively, the sequence providesnew entry points to the genome. Ho-mologs of known genes or parts of genescan be sought by computer. Not only is thisstyle of searching faster than physicalprobing but also it is more thorough.Weak similarities, beyond the detectionlimit of hybridization, can be picked upand evaluated. At present, investigatorscan search a total of about 38 Mb (22.5 Mbfinished, 15.5 Mb unfinished), containingabout half of the genes. Searches willbecome steadily more complete as the

0MW in: 1fl~~ diA:ve zone: 1 39734 1

loumns I IZoom In : rZoom ut Eear I I o .... jI 1GeneEmn

LINK 19Text: 940511: Sent back to finisher may be an extra T at 33235FIOF2

Text: 940621: Extra T was confirmedtoto: (NULL KEY)Clone: F1OF2

Text: Prosite: match PS00017 ATPIGTP-binding site motif A (P-loop)Text: Similarity with lipase (Bacillus subtilis) (blastp score 84)F1OF2.3

Text: hpase (Bacillus subtilis)

15000

-Fl1OF2.4 Text: 2 leucine rich domainsU

m l=

ms - -

- F10F2.7Text: F1OF2.7 is 54.1% identical with FIOF2.8Text: F10F2.7 is 50.5% identical with FIOF2.6

Text: C-lectn binding domainSequence: F10F2.6

Text: F10F2.7 is 56.2% identical with F1OF2.5Sequence: F1OF2.5Sequence: F10F2.8Sequence: F10F2.7

Text: F1OF2.6 is 50.5% identcal with F1OF2.7F1OF2.6 Text: C-lectin binding domain

Sequence: F1OF2.5Text: FlOF2.6 is 74.4% identical with Fl 0F2.8Text: FIOF2.6 is 53.6% identical with F1OF2.5F10F2.8 Sequence: F1OF2.8Text: F10F2.8 is 74.4% identical with FIOF2.6

Sequence: F10F2.7Text: F1OF2.8 is 59.4% identical with FIOF2.5

-Text: Text: C-lectin bindinQ domainText: F1OF2.8 is 54.1% identcal with FIOF2.7

Sequence: FIOF2.6Sequence: F1OF2.5

_- Text: F1OF5.5 is 56.2% identical with F1OF2.7Text: F1OF5.5 is 53.6% identical with FIOF2.6

- es F1OF2.5 Sequence: F1OF2.7Sequence: FIOF2.6

= Text: Text: C-lectn binding domainText: F10F2.5 is 59.4% identical with F1OF2.8

_ - Sequence: F1OF2.8- *F1OF2.9

T23Fl 1 toto: (NULL KEY)Clone: T23F1 1

FIG. 1. To the left of the kb scale-i.e., on the negative strand-a gene with two large introns is shown. To the right of the scale (positive strand)a family of five genes is seen to lie within the introns. Figs. 1 and 2 are taken from the feature window of ACeDB.

10000

200001

250001

30000

35000

i

40000

Proc. Natl. Acad. Sci. USA 92 (1995)

II

I

Dow

nloa

ded

by g

uest

on

June

30,

202

0

Proc. Natl. Acad. Sci. USA 92 (1995) 10839

project proceeds and will be greatly en-hanced as the emerging families of genesand domains are subjected to cluster anal-ysis.As the sequenced regions extend along

the chromosomes, the large-scale struc-ture of the genome starts to emerge. Weare only just beginning to explore thislevel, but some of the early findings areillustrated in Figs. 1 and 2. Additionally,direct analysis of the sequence yields manyinteresting features, with applications to

gene function, evolution, and medicine. Afew selected examples follow.

(i) Genes are found within the introns ofother genes. In one case, five such geneswere found within a single gene (Fig. 1).

(ii) There are clusters of tRNA genes,containing five members in one case andsix in another.

(iii) Different types of gene families arefound: some where the family membersare dispersed and others where they areclose together in tandem arrays. We can

begin to look at the evolution of theindividual members.

(iv) A relatively high incidence of in-verted repeats is found within the introns ofpredicted genes. The functional significanceof this, if any, is unknown.

(v) Genes have been found in head-to-tail patterns, which have been shown to beindicative of operons (Fig. 2) (16).

(vi) Homologs to genes of medical im-portance have been found-e.g., the onlyknown invertebrate Lowe syndrome ho-molog (17).

_R~ [§El I dcivYE zone: 1 40699

Euumn=s Zoom In Zoom l lear ev oim.. ( one

In o 0a1

OS0_G O

fin

a _

ea o1!3 Qua _

m

SUPER LINK LINK 3 LINK 1ZK637.8aZK637-8ab e^ns<-;fi§43quen}wvRX>>

Sequence: yk39g5.5Sequence: ykl Sal 1.3Sequence: ykl ala2.3System: (NULL KEY)Sequence: yklSall.5Sequence: cmO8g12Sequence: yk42b11.3Text: TJ6/roton pumpClone: pUL#AL10Sequence: yk39g5.3Sequence: Yk5id7.3Sequence: WP:CE00438Sequence: yk2Od9.5Sequence: yk5d7.5

ZK637.9 TSLsite Sequence: WP:CE00439

FIG. 2. The three genes ZK637.8, ZK637.9, and ZK637.10 lie head to tail on the positive strand. They have been shown to be transcribed asone operon (16). Comparison with cDNA sequences (yellow boxes) shows that ZK637.8 has two alternative splicing patterns at position 23,000.

22000[

23000

24000

25000

26000

27000

28000

31000

32000

lin-9

ZK637.10 TSL_sfte Sequence: WP:CE00428jI . C Text: Glutahone reductasem _

ml1 8

I a

ZK637.11 Text: CDC25IstirgSequence: WP:CE00429

_

,

Review: Waterston and Sulston

Dow

nloa

ded

by g

uest

on

June

30,

202

0

10840 Review: Waterston and Sulston

(vii) The largest known C. elegans pro-tein has been predicted from a 45-kb genein cosmid K07E12. It contains 13,000amino acid residues and contains multiplecopies of the cell adhesion molecule motif(18, 19).

(viii) Many repeat families have beenidentified. Some are thought to be "dead"transposons from an unknown and possi-bly extinct transposon type.

Apart from patterns of genes, we beginto see the matrix of duplicated, inverted,and transposed pieces of which the ge-nome is composed. Somewhere there areelements that mediate replication, recom-bination, and segregation of the chromo-somes and others that control sex deter-mination, dosage compensation, andglobal gene expression. The sequence willprovide the framework on which hypoth-eses can be developed and tested.

Finally, the sequence forms a perma-nent archive whose value we can onlybegin to tap at the first pass. The analysis,modification, and above all comparison ofsequences from different organisms willprovide a major route to a full understand-ing of biology.

This work was supported by the NationalInstitutes of Health National Center for HumanGenome Research, the U.K. Medical ResearchCouncil, and the Wellcome Trust.

1. Coulson, A. R., Sulston, J. E., Brenner, S.& Karn, J. (1986) Proc. Natl. Acad. Sci.USA 83, 7821-7825.

2. Coulson, A. R., Waterston, R. H., Kiff,J. E., Sulston, J. E. & Kohara, Y. (1988)Nature (London) 335, 184-186.

3. Coulson, A., Kozono, Y., Lutterbach, B.,Shownkeen, R., Sulston, J. E. & Water-ston, R. H. (1991) BioEssays 13, 413-417.

4. Durbin, R. & Thierry-Mieg, J. (1994)Computational Methods in Genome Re-search (Plenum, New York).

5. Edgley, M. L. & Riddle, D. L. (1990)Genetic Maps 5, 3.

6. Waterston, R., Martin, C., Craxton, M.,Huynh, C., Coulson, A., Hillier, L.,Durbin, R., Green, P., Shownkeen, R.,Halloran, N., Metzstein, M., Hawkins, T.,Wilson, R., Berks, M., Du, Z., Thomas, K.,Thierry-Mieg, J. & Sulston, J. (1992) Nat.Genet. 1, 114-123.

7. Vaudin, M., Roopra, A., Hillier, L., Brink-man, R., Sulston, J., Wilson, R. K. &Waterston, R. H. (1995) Nucleic AcidsRes. 23, 670-674.

8. Wilson, R., Ainscough, R., Anderson, K.,Baynes, C., Berks, M., et al. (1994) Nature(London) 368, 32-38.

9. Altschul, S. F., Gish, W., Miller, W., My-ers, E. W. & Lipman, D. J. (1990) J. Mol.Biol. 215, 403-410.

10. Hope, I. A. (1994) Development (Cam-bridge, U.K) 120, 505-514.

11. Lynch, A. S., Briggs, D. & Hope, I. A.(1995) Nat. Genet., in press.

12. Zwaal, R. R., Broeks, A., van Meurs, J. &Plasterk, R. H. A. (1993) Proc. Natl. Acad.Sci. USA 90, 7431-7435.

13. Howell, A. M. & Rose, A. M. (1990)Genetics 126, 583-592.

14. Rose, A. M., Edgley, M. L. & Baillie,D. L. (1994) Adv. Mol. Plant Nematol.,19-33.

15. Schein, J. E., Marra, M. A., Benian,G. M., Fields, C. A. & Baillie, D. L. (1993)Genome 36, 1148-1156.

16. Zorio, D. A. R., Cheng, N. S. N., Blumen-thal, T. E. & Spieth, J. (1994) Nature(London) 372, 270-272.

17. Leahey, A. M., Charnas, L. R. & Nuss-baum, R. L. (1993) Hum. Mol. Genet. 2,461-463.

18. Rathjen, F. G. & Jessell, T. M. (1991)Semin. Neurosci. 3, 297-307.

19. Streuli, M., Krueger, N. X., Tsai, A. Y. M.& Saito, H. (1989) Proc. Natl. Acad. Sci.USA 86, 8698-8702.

20. Albertson, D. G. (1985) EMBO J. 4, 2493-2498.

21. Ioannou, P. A., Amemiya, C. T., Garnes,J., Kroisel, P. M., Shizuya, H., Chen, C.,Batzer, M. A. & de Jong, P. J. (1994) Nat.Genet. 6, 84-89.

22. Shizuya, H., Birren, B., Kim, U.-J.,Mancino, V., Slepak, T., Tachiiri, Y. &Simon, M. (1992) Proc. Acad. Natl. Sci.USA 89, 8794-8797.

23. Staden, R. (1994) Methods Mol. Biol. 25,37-67.

Proc. Natl. Acad. Sci. USA 92 (1995)

Dow

nloa

ded

by g

uest

on

June

30,

202

0

Documents

Review genome of Caenorhabditis elegans · theentire C. elegans researchcommunity. The communal approach is important in two ways. Fromthe point ofview of the map,it ensuresthatthespecializedknowl-edge