28
Principles of Genome Analysis and Genomics Sandy B. Primrose Business and Technology Management High Wycombe Buckinghamshire, UK Richard M. Twyman Department of Biology University of York York, UK THIRD EDITION ··

POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

Principles of Genome Analysis and Genomics

Sandy B. PrimroseBusiness and Technology ManagementHigh WycombeBuckinghamshire, UK

Richard M. TwymanDepartment of BiologyUniversity of YorkYork, UK

THIRD EDITION

··

POGA01 7/12/06 11:39 AM Page i

Page 2: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

© 1995, 1998, 2003 by Blackwell Sciencea Blackwell Publishing company

350 Main Street, Malden, MA 02148-5018, USA108 Cowley Road, Oxford OX4 1JF, UK

550 Swanston Street, Carlton South, Melbourne, Victoria 3053, AustraliaKurfürstendamm 57, 10707 Berlin, Germany

The rights of Sandy Primrose and Richard Twyman to be identified as the Authors of this Workhave been asserted in accordance with the UK Copyright, Designs, and Patents Act 1988.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical,

photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs, and Patents Act 1988, without the prior permission of the publisher.

First published 1995Second edition 1998Third edition 2003

Library of Congress Cataloging-in-Publication Data

Primrose, S.B.Principles of genome analysis and genomics/

Sandy B. Primrose, Richard M. Twyman.—3rd ed.p. cm.

Includes bibliographical references and index.ISBN 1-40510-120-2 (pbk.: alk. paper)

1. Gene mapping. 2. Nucleotide sequence. 3. Genomics.I. Twyman, Richard M. II. Title.

QH445.2.P75 2002572.8′633—dc21 2002070937

A catalogue record for this title is available from the British Library.

Set in 9A/12pt Photinaby Graphicraft Limited, Hong Kong

Printed and bound in Italyby G. Canale & C.S.P.A., Turin

For further information onBlackwell Publishing, visit our website:http://www.blackwellpublishing.com

··

POGA01 7/12/06 11:39 AM Page ii

Page 3: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

Retrofitting, 44Choice of vector, 45Suggested reading, 45Useful website, 46

4 Assembling a physical map of thegenome, 47Introduction, 47Restriction enzyme fingerprinting, 47Marker sequences, 49Hybridization assays, 59Physical mapping without cloning, 64Integration of different mapping methods, 69Suggested reading, 70Useful websites, 70

5 Sequencing methods and strategies, 71Basic DNA sequencing, 71Modifications of chain-terminator sequencing, 73Automated DNA sequencing, 75DNA sequencing by capillary arrayelectrophoresis, 76Basecalling and sequence accuracy, 77High-throughput sequencing, 77Sequencing strategies, 78Alternative DNA sequencing methodologies, 85Suggested reading, 92Useful websites, 93

6 Genome annotation and bioinformatics, 94Introduction, 94Traditional routes to gene identification, 94Databases, 97Overview of sequence analysis, 97Detecting open-reading frames, 99Software programs for finding genes, 99Using homology to find genes, 102

··

Contents

Preface, vAbbreviations, vi

1 Setting the scene: the new science ofgenomics, 1Introduction, 1Physical mapping of genomes, 1Sequencing whole genomes, 3Benefits of genome sequencing, 5Outline of the rest of the book, 6Terminology, 8Keeping up to date, 8Suggested reading, 8Useful websites, 9

2 The organization and structure ofgenomes, 10Introduction, 10Genome size, 10Sequence complexity, 11Introns and exons, 13Genome structure in viruses and prokaryotes, 16The organization of organelle genomes, 18The organization of nuclear DNA ineukaryotes, 19Suggested reading, 33Useful websites, 33

3 Subdividing the genome, 34Introduction, 34Fragmentation of DNA with restrictionenzymes, 34Separating large fragments of DNA, 36Isolation of chromosomes, 37Chromosome microdissection, 39Vectors for cloning DNA, 39Yeast artificial chromosomes, 40P1-derived and bacterial artificialchromosomes as alternatives to yeast artificialchromosomes, 42

POGA01 7/12/06 11:40 AM Page iii

Page 4: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

iv Contents

Analysis of non-coding RNA and extragenic DNA, 105Identifying the function of a new gene, 105Secondary databases of functional domains, 107Gene ontology, 108Analyses not based on homology, 108Genome annotation, 111Molecular phylogenetics, 111Suggested reading, 112Useful websites, 112

7 Comparative genomics, 113Introduction, 113Orthologues, paralogues and genedisplacement, 113Protein evolution by exon shuffling, 113Comparative genomics of prokaryotes, 114Comparative genomics of organelles, 121Comparative genomics of eukaryotes, 123Other aspects of comparative genomics, 130Suggested reading, 132Useful websites, 132

8 Protein structural genomics, 133Introduction, 133Determining gene function by sequencecomparison, 133Determining gene function through conservedprotein structure, 134Approaches to protein structural genomics, 136Suggested reading, 144Useful website, 145

9 Global expression profiling, 146Introduction, 146Traditional approaches to expression profiling, 146Global analysis of RNA expression, 148

Global analysis of protein expression, 163Suggested reading, 171Useful websites, 172

10 Comprehensive mutant libraries, 173Introduction, 173High-throughput systematic gene knockout, 173Genome-wide random mutagenesis, 175Suggested reading, 186Useful websites, 186

11 Mapping protein interactions, 187Introduction, 187Methods for protein interaction analysis, 187Library-based methods for screening proteininteractions, 189Informatics tools and resources for proteininteraction data, 196Suggested reading, 199Useful websites, 199

12 Applications of genome analysis andgenomics, 200Introduction, 200Theme 1: Understanding genetic diseases ofhumans, 200Theme 2: Understanding responses to drugs(pharmacogenomics), 209Theme 3: Getting to grips with bacterialpathogenicity, 211Theme 4: Getting to grips with quantitativetraits, 218Theme 5: The impact of genomics onagriculture, 223Theme 6: Developmental genomics, 225Suggested reading, 228

References, 230Index, 255

··

POGA01 7/12/06 11:40 AM Page iv

Page 5: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

For most of the 20th century, a central problem ingenetics was the creation of maps of entire chromo-somes. These maps were crucial for the understand-ing of the structure of genes, their function and theirevolution. For a long time these maps were createdby genetic means, i.e. as a result of sexual crosses.Starting about 15 years ago, recombinant DNAtechnology was used to generate molecular or phys-ical maps, defined here as the ordering of distin-guishable DNA fragments by their position along thechromosome. Two key features of physical maps arethat they can be generated much more quickly thangenetic maps, and they usually have a much denserarray of markers. The existence of physical mapsnow is greatly facilitating the analysis of a number ofkey questions in genetics such as the molecular basisof polygenic disorders and quantitative traits.

Physical mapping embraces a wide range ofmanipulative and analytical techniques which aredetailed in specialist journals using a specialist lan-guage. This means that it is difficult for even theexperienced biologist entering the field to compre-hend the latest developments or even what has beenachieved. The first edition of this book was written toprovide these new entrants with an overview of themethodologies employed.

The ultimate physical map is a complete genomesequence. Shortly after the first edition of this bookwas published in 1995, the complete sequence of abacterial genome was reported for the first time.From a technical point of view this was a particu-larly noteworthy achievement because the genomesequenced had a size of 1.8 million base pairs, yet the longest individual piece of DNA that can besequenced is only 600–800 nucleotides. Soon there-after, the sequences of several mammalian chromo-

somes were reported as well as the entire genome of the yeast Saccharomyces cerevisiae. The key issuewas no longer how to sequence a genome but how tohandle the sequence data. These changes werereflected in the second edition published in 1998.

As we entered the 21st century, the list ofsequenced genomes included over 60 bacteria plus those of yeast, the nematode, the fruit fly, aflowering plant and humans. The size of the humangenome (3 billion base pairs) indicates the progressmade since the first edition was published. Sequen-cing of whole genomes is progressing at a rapid ratebut the emphasis is now shifting back to biologicalquestions. For example, how do the different com-ponents of the genome and the different gene prod-ucts interact? Answers to questions such as this arebeing provided using yet another new set of tools in combination with the established mapping andsequencing methodologies. This branch of biology isknown as genomics and its importance is such thathalf of this third edition is devoted to it.

We would like to thank our friends and colleagueswho supplied material for this book and took thetime to read and make comments on individualchapters, in particular Drs Phillip Gardner, DylanSweetman, Ajay Kohli, Gavin Craig, Eva Stoger and Professor Christine Godson. Thanks also to SueGoddard and her staff in the CAMR library whocheerfully found references and papers at shortnotice. Richard Twyman would like to thankKathryn Parry and family (Cain, Evan, Tate andImogen) for their kindness and hospitality duringthe summer of 2001 when this project began. Hewould like to dedicate this book to his parents, Peterand Irene, his children, Emily and Lucy, and toHannah, Joshua and Dylan.

··

Preface

POGA01 7/12/06 11:41 AM Page v

Page 6: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

2DE two-dimensional gel electrophoresisAc ActivatorADME adsorption, distribution, metabolism

and excretionAFBAC affected family-based controlAFLP amplified fragment length

polymorphismALL acute lymphoblastic leukaemiaAML acute myeloid leukaemiaAPL acute promyelocytic leukaemiaARS autonomously replicating sequenceATRA all-trans-retinoic acidBAC bacterial artificial chromosomeBCG Bacille Calmette–GuérinbFGF basic fibroblast growth factorBIND Biomolecular Interaction Network

DatabaseBLAST Basic Local Alignment Search ToolBLOSUM Blocks Substitution MatrixBMP bone morphogenetic proteinbp base pairBRET bioluminescence resonance energy

transferCAPS cleaved amplified polymorphic

sequencesCASP Critical Assessment of Structural

PredictionCATH Class, Architecture, Topology and

Homologous superfamily (database)CCD charge couple deviceCD circular dichroismcDNA complementary DNACEPH Centre d’Etude du Polymorphisme

HumainCHEF contour-clamped homogeneous

electrical fieldCID collision-induced dissociationcM centimorganCOG cluster of orthologous groupscR centiRaycRNA complementary RNA

CSSL chromosome segment substitutionline

ct chloroplastDDBJ DNA Databank of JapanDIP Database of Interacting ProteinsDMD Duchenne muscular dystrophyDNA deoxyribonucleic aciddNTP deoxynucleoside triphosphateDs DissociationdsDNA double-stranded DNAdsRNA double-stranded RNAEGF epidermal growth factorELISA enzyme-linked immunosorbent

sandwich assayEMBL European Molecular Biology

LaboratoryENU ethylnitrosoureaES embryonic stem (cells)ESI electrospray ionizationEST expressed sequence tagEUROFAN European Functional Analysis

Network (consortium)FACS fluorescence-activated cell sortingFEN flap endonucleaseFIAU Fialuridine (1-2′-deoxy-2′-fluoro-β-

d-arabinofuranosyl-5-iodouracil)FIGE field-inversion gel electrophoresisFISH fluorescence in situ hybridizationFPC fingerprinted contigsFRET fluorescence resonance energy

transferFSSP Fold classification based on

Structure–Structure alignment ofProteins (database)

GASP Genome Annotation AssessmentProject

G-CSF granulocyte colony stimulatingfactor

GeneEMAC gene external marker-basedautomatic congruencing

GGTC German Gene Trap Consortium

··

Abbreviations

POGA01 7/12/06 11:41 AM Page vi

Page 7: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

Abbreviations vii

GST gene trap sequence tagGST glutathione-S-transferaseHAT hypoxanthine, aminopterin and

thymidineHDL high-density lipoproteinHERV human endogenous retrovirusHPRT hypoxanthine phosphoribosyl-

transferaseHTF HpaII tiny fragmenthtSNP haplotype tag single nucleotide

polymorphismIDA interaction defective alleleIhh Indian hedgehogIPTG isopropylthio-β-d-galactopyranosideIST interaction sequence tagIVET in vivo expression technologykb kilobaseLCR low complexity regionLD linkage disequilibriumLINE long interspersed nuclear elementLOD logarithm10 of oddsLTR long terminal repeatm : z mass : charge ratioMAD multiwavelength anomalous

diffractionMAGE microarray and gene expressionMAGE-ML microarray and gene expression

mark-up languageMAGE-OM microarray and gene expression

object modelMALDI matrix assisted laser desorption

ionizationMb megabaseMGED Microarray Gene Expression

DatabaseMIAME minimum information about a

microarray experimentMIP molecularly imprinted polymerMIPS Munich Information Center for

Protein SequencesMM ‘mismatch’ oligonucleotideMPSS massively parallel signature

sequencingmRNA messenger RNAMS mass spectrometryMS/MS tandem mass spectroscopymt mitochondrialMTM Maize Targeted Mutagenesis projectMu Mutator

MudPIT multidimensional proteinidentification technology

NGF nerve growth factorNIL near isogenic lineNMR nuclear magnetic resonanceNOE nuclear Overhauser effectNOESY NOE spectroscopynt nucleotideOFAGE orthogonal-field-alternation gel

electrophoresisORF open-reading frameORFan orphan open-reading frameP/A presence/absence polymorphismPAC P1-derived artificial chromosomePAGE polyacrylaminde gel electrophoresisPAI pathogenicity islandPAM percentage of accepted point

mutationsPCR polymerase chain reactionPDB Protein Databank (database)Pfam Protein families database of

alignmentsPFGE pulsed field gel electrophoresisPM ‘perfect match’ oligonucleotidepoly(A)+ polyadenylatedPQL protein quantity lociPRINS primed in situPS position shift polymorphismPSI-BLAST Position-Specific Iterated BLAST

(software)PVDF polyvinylidine difluorideQTL quantitative trait lociRACE rapid amplification of cDNA endsRAPD randomly amplified polymorphic

DNARARE RecA-assisted restriction

endonucleaseRC recombinant congenic (strains)RCA rolling circle amplificationrDNA/RNA ribosomal DNA/RNARFLP restriction fragment length

polymorphismRIL recombinant inbred lineR-M restriction-modificationRNA ribonucleic acidRNAi RNA interferenceRNase ribonucleaseRPMLC reverse phase microcapillary liquid

chromatography

··

POGA01 7/12/06 11:41 AM Page vii

Page 8: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

viii Abbreviations

RT-PCR reverse transcriptase polymerasechain reaction

RTX repeats in toxinsSAGE serial analysis of gene expressionSCOP Structural Classification of Proteins

(database)SDS sodium dodecylsulphateSELDI surface-enhanced laser desorption

and ionizationSGDP Saccharomyces Gene Deletion ProjectShh sonic hedgehogSINE short interspersed nuclear elementSINS sequenced insertion sitesSNP single nucleotide polymorphismSPIN Surface Properties of protein–protein

Interfaces (database)Spm Suppressor–mutatorSPR surface plasmon resonanceSRCD synchrotron radiation circular

dichroismSSLP simple sequence length

polymorphism

SSR simple sequence repeatSTC sequence-tagged connectorSTM signature-tagged mutagenesisSTS sequence-tagged siteTAC transformation-competent artificial

chromosomeTAFE transversely alternating-field

electrophoresisTAR transformation-associated

recombinationT-DNA Agrobacterium transfer DNATIGR The Institute for Genomic ResearchTIM triose phosphate isomeraseTOF time of flighttRNA transfer RNATUSC Trait Utility System for CornUPA universal protein arrayUTR untranslated regionVDA variant detector arrayVIGS virus-induced gene silencingY2H yeast two-hybridYAC yeast artificial chromosome

··

POGA01 7/12/06 11:41 AM Page viii

Page 9: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

··

CHAPTER 1

Setting the scene: the new science of genomics

Introduction

Genetics is the study of the inheritance of traits fromone generation to another. As such, it examines thephenotypes of the offspring of sexual crosses. Usefulas these data may be, they cannot provide an explana-tion for the biological basis of a phenotype for thatrequires biochemical information. In some cases thejump from phenotype to biochemical explanationwas relatively simple. Good examples are amino acidauxotrophy and antibiotic resistance in microor-ganisms and phenylketonuria and sickle cell diseasein humans. However, until recently, it was almostimpossible to determine the biochemical basis formost of the traits in most organisms.

The first major advance in understanding pheno-types came in the mid-1970s with the developmentof methods for manipulating genes in vitro (‘geneticengineering’). This permitted genes to be cloned and sequenced which in turn provided data on theamino acid sequence of the gene product. It alsobecame possible to overexpress the gene product,thereby facilitating its purification and characteriza-tion. As the number of characterized gene productshas grown, the determination of gene function hasbecome easier, as it is possible to search databases for closely related proteins whose properties areknown. Other techniques that have facilitated theanalysis of phenotypes are site-directed mutagen-esis, where specific base changes or deletions can bemade in genes, and gene replacement. All of thesetechniques, and many others, are described in ourcompanion volume Principles of Gene Manipulation(Primrose et al. 2001).

Over the past 25 years a vast amount of data has been generated for thousands of different geneproducts from many different organisms, most of itas a direct result of the ability to manipulate genes.Impressive as this is, gene manipulation on its owncannot meet all the needs of biologists. First, in manyinstances, a gene needs to be mapped close to a

convenient marker before it can be cloned. Whilethis may be easy in an organism such as Drosophilawhere many mutants are available, it is much moredifficult in humans or in organisms whose geneticshave been poorly studied. Secondly, understandingthe phenotype of one or a few genes gives little infor-mation about the whole organism and how all itscomponents interact, e.g. its metabolic capabilitiesor how it controls its development. Thirdly, the an-alysis of a few genes does not enable us to answer the big questions in biology. For example, how didspeech and memory evolve, what changes at theDNA level occurred as the primates evolved, etc.?However, these needs now are being met as a resultof efforts to sequence the entire genomes of a num-ber of organisms.

Physical mapping of genomes

In the mid-1980s, scientists began to discuss seri-ously how the entire human genome might besequenced. To put these discussions in context, thelargest stretch of DNA that can be sequenced in asingle pass is 600–800 nucleotides and the largestgenome that had been sequenced was the 172 kbEpstein–Barr virus DNA (Baer et al. 1984). By com-parison, the human genome has a size of 3000 Mb.One school of thought was that completely newsequencing methodology would be required and anumber of different technologies were explored butwith little success. Early on, it was realized that inorder to sequence a large genome it would be neces-sary to break the genome down into more manage-able pieces for sequencing and then join the piecestogether again. The problem here was that therewere not enough markers on the human genome. Itshould be noted that humans represent an extremecase of difficulty in creating a genetic map. Not onlyare directed matings not possible, but the length of thebreeding cycle (15–20 years) makes conventional

POGC01 7/12/06 11:55 AM Page 1

Page 10: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

2 Chapter 1

analysis impossible. A major breakthrough was the development of methods for using DNA probes to identify polymorphic sequences (Botstein et al.1980). The first such DNA polymorphisms to bedetected were differences in the length of DNA frag-ments after digestion with sequence-specific restric-tion endonucleases, i.e. restriction fragment lengthpolymorphisms (RFLPs; Fig. 1.1).

To generate an RFLP map the probes must behighly informative. This means that the locus must not only be polymorphic, it must be very poly-morphic. If enough individuals are studied, any randomly selected probe will eventually discover apolymorphism. However, a polymorphism in whichone allele exists in 99.9% of the population and the other in 0.1% is of little utility because it seldomwill be informative. Thus, as a general rule, theRFLPs used to construct the genetic map shouldhave two, or perhaps three, alleles with equivalentfrequencies.

The first RFLP map of an entire genome (Fig. 1.2)was that described for the human genome by Donis-

Keller et al. (1987). They tested 1680 clones from a phage library of human genomic DNA to seewhether they detected RFLPs by hybridization toSouthern blots of DNA from five unrelated indi-viduals. DNA from each individual was digestedwith 6–9 restriction enzymes. Over 500 probes wereidentified that detected variable banding patternsindicative of polymorphism. From this collection, asubset of 180 probes detecting the highest degree ofpolymorphism was selected for inheritance studiesin 21 three-generation human families (Fig. 1.3).Additional probes were generated from chromo-some-specific libraries such that ultimately 393RFLPs were selected. The various loci were arrangedinto linkage groups representing the 23 humanchromosomes by a combination of mathematicallinkage analysis and physical location of selectedclones. The latter was achieved by hybridizingprobes to panels of rodent–human hybrid cells con-taining varying human chromosomal complements(see p. 38). RFLP maps have not been restricted tothe human genome. For example, RFLP maps have

··

Parents

Offspring

(b)

Heterozygousparents

Affectedoffspring

Heterozygousoffspring

Normaloffspring

Heterozygousparents

Directionof DNAmigration

(a)

Restriction sites

Probe

Normal

Affected

Fig. 1.1 Example of a RFLP and its usefor gene mapping. (a) A polymorphicrestriction site is present in the DNAclose to the gene of interest. In theexample shown, the polymorphic site is present in normal individuals butabsent in affected individuals. (b) Use ofthe probe shown in Southern blottingexperiments with DNA from parentsand progeny for the detection of affectedoffspring.

POGC01 7/12/06 11:55 AM Page 2

Page 11: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

The new science of genomics 3

been published for most of the major crops (see forexample Moore et al. 1995).

The human genome map produced by Donis-Keller et al. (1987) was a landmark publication.However, it identified RFLP loci with an averagespacing of 10 centimorgans (cM). That is, the locihad a 10% chance of recombining at meiosis. Giventhat the human genome is 4000 cM in length, thedistance between the RFLPs is 10 Mb on average.This is too great to be of use for gene isolation. How-ever, if the methodology of Donis-Keller et al. (1987)was used to construct a 1 cM map, then 100 timesthe effort would be required! This is because 10 timesas many probes would be required and 10 timesmore families studied. The solution has been to usemore informative polymorphic markers and othermapping techniques and these are described indetail in Chapter 4. Use of these techniques has led to the generation of a human map with the desireddensity of markers. More important, these advances

in gene mapping were not restricted to the humangenome. Rather, the methodology is generic andnow has been applied to a wide range of animal andplant genomes.

Sequencing whole genomes

The late 1980s and early 1990s saw much debateabout the desirability of sequencing the humangenome. This debate often strayed from rationalescientific debate into the realms of politics, personal-ities and egos. Among the genuine issues raised werequestions such as: Is the sequencing of the humangenome an intellectually appropriate project forbiologists?; Is sequencing the human genome feas-ible?; What benefits might arise from the project?;Will these benefits justify the cost and are there alter-native ways of achieving the same benefits?; Will theproject compete with other areas of biology for fund-ing and intellectual resources? Behind the debatewas a fear that sequencing the human genome wasan end in itself, much like a mountaineer who climbsa new peak just because it is there.

In early 2001 two different groups (InternationalHuman Genome Sequencing Consortium 2001;Venter et al. 2001) reported the draft sequence of the

··

1 22 2

1 2

2 32 2

2 22 3 2 2 2 2 2 3 1 2 2 3 1 2

2 3

–1

–2

–3

Fig. 1.3 Inheritance of a RFLP in three generations of afamily. The RFLP probe used detects a single locus on humanchromosome 5. In the family shown, three alleles are detectedon Southern blotting after digestion with TaqI. For each of theparents it can be inferred which allele was inherited from thegrandmother and which from the grandfather. For each childthe grandparental origin of the two alleles can then beinferred. (Redrawn from Donis-Keller et al. 1987, withpermission from Elsevier Science.)

Fig. 1.2 The first RFLP genetic linkage map of the entirehuman genome. (Reproduced from Donis-Keller et al. 1987,with permission from Elsevier Science.)

POGC01 7/12/06 11:55 AM Page 3

Page 12: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

4 Chapter 1

human genome. An analysis of this achievementprovides clear answers to the questions raised above.Those opposed to the idea of sequencing the humangenome had cited the resources (thousands of sci-entists and billions of dollars) and time that would be required to accomplish the task. Furthermore,they believed that once the human genome wassequenced there would be a major logistical problemin handling the sequence data. What happened was that the scientific community developed newstrategies for sequencing genomes, rather than newmethods for sequencing DNA, and complementedthese with the development of highly automatedmethodologies. The net effect was that by the timethe human genome had been sequenced, the com-plete sequence was already known for over 30 bac-terial genomes plus that of a yeast (Saccharomycescerevisiae), the fruit fly (Drosophila melanogaster), anematode (Caenorhabditis elegans) and a plant (Ara-bidopsis thaliana) (Table 1.1). Furthermore, a wholenew science, bioinformatics, had been developed tohandle and analyse the vast amounts of infor-mation being generated by these sequencing pro-jects. Fortuitously, the global development of theInternet occurred at the same time and this enabledscientists around the world to have access to thebioinformatics tools developed in global centres ofexcellence.

The development of bioinformatics not only facil-itated the handling and analysis of sequence databut the development of sequencing strategies as

well. For example, when a European consortium setthemselves the goal of sequencing the entire genomeof the budding yeast Saccharomyces (15 Mb), theysegmented the task by allocating the sequencing ofeach chromosome to different groups. That is, theysubdivided the genome into more manageable parts.At the time this project was initiated there was noother way of achieving the objective and when theresulting genomic sequence was published (Goffeauet al. 1996), it was the result of a unique multicentrecollaboration. While the Saccharomyces sequencingproject was underway, a new genomic sequencingstrategy was unveiled: shotgun sequencing. In thisapproach, large numbers of genomic fragments aresequenced and sophisticated bioinformatics algo-rithms used to construct the finished sequence. Incontrast to the consortium approach used with Sac-charomyces, a single laboratory set up as a sequen-cing factory undertook shotgun sequencing.

The first success with shotgun sequencing wasthe complete sequence of the bacterium Haemophilusinfluenzae (Fleischmann et al. 1995) and this wasquickly followed with the sequences of Mycoplasmagenitalium (Fraser et al. 1995), Mycoplasma pneumo-niae (Himmelreich et al. 1996) and Methanococcusjannaschii (Bult et al. 1996). It should be noted thatH. influenzae was selected for sequencing because solittle was known about it: there was no genetic mapand not much biochemical data either. By contrast,S. cerevisiae was a well-mapped and well-character-ized organism. As will be seen in Chapter 5, the

··

Table 1.1 Increases in sizes of genomes sequenced.

Genome sequenced

Bacteriophage fX174Plasmid pBR322Bacteriophage lEpstein–Barr virusYeast chromosome IIIHaemophilus influenzaeSaccharomyces cerevisiaeCeanorhabditis elegansDrosophila melanogasterArabidopsis thalianaHomo sapiensRice (Oryza sativa)Pufferfish (Fugu rubripes)Mouse (Mus musculis)

Year

19771979198219841992199519961998200020002001200220022002/3

Genome size

5.38 kb4.3 kb

48.5 kb172 kb315 kb

1.8 Mb12 Mb97 Mb

165 Mb125 Mb

3000 Mb430 Mb400 Mb

2700 Mb

Comment

First genome sequencedFirst plasmid sequenced

First chromosome sequencedFirst genome of cellular organism to be sequencedFirst eukaryotic genome to be sequencedFirst genome of multicellular organism to be sequenced

First plant genome to be sequencedFirst mammalian genome to be sequencedFirst crop plant to be sequencedSmallest known vertebrate genomeClosest model organism to man

POGC01 7/12/06 11:55 AM Page 4

Page 13: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

The new science of genomics 5

relative merits of shotgun sequencing vs. ordered,map-based sequencing still are being debated today.Nevertheless, the fact that a major sequencing lab-oratory can turn out the entire sequence of a bac-terium in 1–2 months shows the power of shotgunsequencing.

Benefits of genome sequencing

Fears that sequencing the human genome would be an end in itself have proved groundless. Becauseso many different genomes have been sequenced itnow is possible to undertake comparative analyses,a topic known as comparative genomics. By compar-ing genomes from distantly related species we canbegin to decipher the major stages in evolution. Bycomparing more closely related species we can beginto uncover more recent events such as genome rearrangement and mutation processes. Currently,the most fertile area of comparative genomics is theanalysis of bacterial genomes because so many havebeen sequenced. Already this analysis is throwingup some interesting questions. For example, over25% of the genes in any one bacterial genome haveno analogues in any other sequenced genome. Isthis an artefact resulting from limited sequence dataor does it reflect the unique evolutionary events thathave shaped the genomes of these organisms? Again,comparative analysis of the genomes of a wide rangeof thermophiles has revealed numerous interestingfeatures, including strong evidence of extensive hori-zontal gene transfer. However, what is the genomicbasis for thermophily? We still do not know. Com-parative genomics also is of value with higherorganisms. Selective breeding is not acceptable withhumans so we use other mammals as surrogates.Which is the best species to select as a surrogate?Comparative genomics can answer this question.

One of the fascinating aspects of the classic paperof Fleischmann et al. (1995) was their analysis of themetabolic capabilities of H. influenzae which theydeduced from sequence information alone. Thisanalysis (metabolomics) has been extended to everyother sequenced genome and is providing tremend-ous insight into the physiology and ecological adapt-ibility of different organisms. For example, obligateparasitism in bacteria is linked to the absence ofgenes for certain enzymes involved in central meta-bolic pathways. Another example is the correlation

between genome size and the diversity of ecologicalniches that can be colonized. The larger the bacterialgenome, the greater are the metabolic capabilities ofthe host organism and this means that the organismcan be found in a greater number of habitats.

Analysis of genomes has enabled us to identifymost of the genes that are present. However, we stilldo not know what functions many of these genesperform and how important these genes are to thelife of the cell. Nor do we know how the differentgene products interact with each other. Becausemost gene products are proteins, the study of theseinteractions is known as proteomics. It is a disciplinewhich is rooted in experiments rather than com-puter analysis. For example, the principal way ofdetermining the function of a gene is to delete it andthen monitor the fitness of the deletion strain undera variety of selective conditions. This is an enormoustask but strategies have been developed for sim-plifying it. Of course, such a methodology cannot beused with humans, hence the need for comparativegenomics.

The biochemistry and genetics of many humandiseases have been elucidated but most of these aresimple single-gene disorders. Unfortunately, most ofthe common diseases of humans, and those that putmost financial burden on health provision, are poly-genic disorders, e.g. hypertension and cancer. Cur-rently we know very little about the causes of thesediseases and hence we are reduced to treating thesymptoms. One consequence of this is that manydrugs have undesirable side-effects or else only workwith certain individuals. Advances in genomics areenabling us to get a better handle on the causes ofdisease and this will lead to new therapies. Advancesin physical mapping of the human genome areenabling us to better predict the effectiveness of adrug and the likely side-effects. Another advance inhuman genetics is the beginning of an understand-ing of complex social traits. For example, the firstgene controlling speech has just been identified (Laiet al. 2001). Polygenic traits are of equal importancein agronomy. Many different characteristics such asweight gain, size, yield, etc. are controlled by manydifferent loci. Again, physical mapping is enablingus to identify the different loci involved and this willfacilitate more rational breeding programmes.

Another benefit of genome mapping and sequenc-ing that deserves mention is international scientificcollaboration. In magnitude, the goal of sequencing

··

POGC01 7/12/06 11:55 AM Page 5

Page 14: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

6 Chapter 1

the human genome was equivalent to putting a manon the moon. However, putting a man on the moonwas a race between two nations and was driven byglobal political ambitions as much as by scientificchallenge. By contrast, genome sequencing trulyhas been an international effort requiring labora-tories in Europe, North America and Japan to col-laborate in a way never seen before. The extent of this collaboration can be seen from an analysis of the affiliations of the authors of the papers on the sequencing of the genomes of Arabidopsis (TheArabidopsis Genome Initiative 2000) and humans(International Human Genome Sequencing Con-sortium 2001), for example. The fact that one UScompany, Celera, has successfully undertaken manysequencing projects in no way diminishes this col-laborative effort. Rather, they have constantly chal-lenged the accepted way of doing things and haveincreased the efficiency with which key tasks havebeen undertaken.

Three other aspects of genome sequencing andgenomics deserve mention. First, in other branchesof science such as nuclear physics and space exploration, the concept of ‘superfacilities’ is wellestablished. With the advent of whole genomesequencing, biology is moving into the superfacil-ity league and a number of sequencing ‘factories’have been established. Secondly, high throughputmethodologies have become commonplace and thishas meant a partnering of biology with automation,instrumentation and data management. Thirdly,many biologists have eschewed chemistry, physicsand mathematics but progress in genomics demandsthat biologists have a much greater understandingof these subjects. For example, methodologies suchas mass spectrometry, X-ray crystallography andprotein structure modelling are now fundamental tothe identification of gene function. The impact thatthis has on undergraduate recruitment in the sci-ences remains to be seen.

Outline of the rest of the book

The remainder of the book is divided into three parts.The first part concentrates on the methods devel-oped for mapping and sequencing genomes and onthe basics of bioinformatics (Fig. 1.4). The secondpart deals with genomics; i.e. the analysis of genome

data and the use of map and sequence data to locategenes of interest and to understand phenotypes andfundamental biological phenomena (Fig. 1.5). Thusin the second part of the book, we provide a solutionto the problem of understanding the phenotype asoutlined earlier in this chapter. Finally, in the lastpart (Chapter 12) we review some of the applicationsof the methodologies discussed in the preceding twosections.

The genomes of free-living cellular organismsrange in size from less than 1 Mb for some bacteria tomillions, or tens of millions, of megabases for someplants. It may even come as a surprise to some toknow that a protozoan, never mind a plant, canhave a larger genome than that of humans. How-ever, size does not necessarily equal gene content, aphenomenon first elaborated as the C-value paradox(p. 11). Rather, size is often a reflection of genomestructure and organization, particularly that ofrepetitive DNA. This topic is covered in Chapter 2.

The sheer size of the genome of even a simple bacterium is such that to handle it in the laboratorywe need to break it down into smaller pieces that arehandled as clones. The methods for doing this arecovered in Chapter 3. The process of putting thepieces back together again involves mapping. Manydifferent sequence markers are used to do this, aswell as some novel mapping methods, and these aredescribed in Chapter 4. DNA sequencing technologyis such that only short stretches (~600 bp) can beanalysed in a single reaction. Consequently, thegenome has to be fragmented and the sequence ofeach fragment determined and the total sequencereassembled (Chapter 5). Fortunately, the tools andtechniques used for mapping also can be applied togenome sequencing. Finally, all the sequence datagenerated need to be stored and the information thatthey contain extracted. An introduction to this topicof bioinformatics is provided in Chapter 6.

Sequencing a genome is not an end in itself.Rather, it is just the first stage in a long journeywhose goal is a detailed understanding of all the biological functions encoded in that genome andtheir evolution. To achieve this goal it is necessary todefine all the genes in the genome and the functionsthat they encode. There are a number of differentways of doing this and these are covered in the second part of the book (Chapters 7–11; Fig. 1.5).One such technique is comparative genomics (Chap-

··

POGC01 7/12/06 11:55 AM Page 6

Page 15: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

The new science of genomics 7

ter 7). The premise here is that DNA sequencesencoding important cellular functions are likely tobe conserved whereas dispensable or non-coding sequences will not. However, comparative genomicsonly gives a broad overview of the capabilities of different organisms. For a more detailed view oneneeds to identify each gene in the genome and itsfunction. Such whole genome annotation involves acombination of computer and experimental analysisand is described in Chapters 8–11.

In classical biochemistry one starts by purifying a protein of known function and then determiningits structure. In structural genomics, as described inChapter 8, one does the opposite on the basis thatpeptide sequences with a similar primary or second-ary structure are likely to have similar functions.Chapter 9 is devoted to the burgeoning field of

high throughput expression analysis which is beingused with great success to determine the function of anonymous genes. The methodologies used alsogive qualitative and quantitative information aboutgene expression as it relates to the biology of thewhole organism. For example, it is possible to iden-tify all the genes being transcribed in any cell or tis-sue at any time and the extent to which these RNAsare translated into proteins. Chapter 10 explores theidea of determining gene function by mutation.Whereas this is carried out on a gene-by-gene basisin classical genetics, in genomics it is performed on agenome-wide scale. Chapter 11 describes the invest-igation of protein–protein interactions and howthese interactions are being mapped and assembledinto databases in an attempt to link all proteins inthe cell into a functional network.

··

Chromosome

Genome

Library

Map

Sequence

Gene

Fragmentation with endonucleasesSeparation of large DNA fragmentsIsolation of chromosomesChromosome microdissectionVectors for cloning

Chapter 3

Genome sizeSequence complexityIntrons and exonsGenome structureRepetitive DNA

Chapter 2

Restriction fingerprintingSTSs, ESTs, SSLPs and SNPsRAPDs, CAPs and AFLPsHybridization mappingOptical mapping, radiation hybrids and HAPPY mappingIntegration of mapping methods

Chapter 4

Sequencing methodologyAutomation and high throughput sequencingSequencing strategiesSequencing large genomesPyrosequencingSequencing by hybridization

Chapter 5

Databases and softwareFinding genesIdentifying gene functionGenome annotationMolecular phylogenetics

Chapter 6

Fig. 1.4 ‘Road map’ outlining thedifferent methodologies used formapping and sequencing genomes.

POGC01 7/12/06 11:55 AM Page 7

Page 16: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

8 Chapter 1

Terminology

Workers in the field of genomics have coined awhole series of ‘-omics’ terms to describe sub-discip-lines of what they do. Confusingly, not everyoneuses these terms in the same way. Our use of theseterms is defined in Box 1.1.

Keeping up to date

The science of genomics is moving forward at anincredible pace and significant new advances arebeing reported weekly. This in turn has led to the publication of a plethora of new journals with ‘-omics’ in their titles and which many hard-pressedlibraries will be unable to afford. Fortunately, muchof this material can now be accessed through theInternet and in the chapters that follow reference is made to relevant websites whenever possible. Any reader not familiar with the PubMed website

(http://www.ncbi.nlm.nih.gov/PubMed/) is stronglyadvised to spend some time browsing it as it providesvery useful access to a wide range of literature. It alsohas links to the contents pages of many journals.

Suggested reading

Donis-Keller H. et al. (1987) A genetic linkage map of thehuman genome. Cell 51, 319–337. This is a classic paper anddescribes the first comprehensive human genetic map to be con-structed using DNA-based markers.

Fleischmann R.D. et al. (1995) Whole-genome randomsequencing and assembly of Haemophilus influenzae Rd.Science 269, 496–512. This is another classic paper and thewealth of information about the biology of the bacterium thatwas inferred from the sequence data provided the justification, ifone was needed, for whole genome sequencing.

Primrose S.B., Twyman R.M. & Old R.W. (2001) Principles ofGene Manipulation (6th edn.) Blackwell Science, Oxford. Thistextbook is widely used around the world and provides a detailedintroduction to the many different techniques of gene manipula-tion that form the basis of the methods for genome analysis.

··

Chapter 8Structuralgenomics

Chapter 7Comparative

genomics

Chapter 11Protein

interactions

Chapter 9Expressionprofiling

Chapter 10Mutantlibraries

Chapter 6Annotation andbioinformatics

Fig. 1.5 Organization of the secondpart of the book. Note that Chapter 6,which discusses bioinformatics and thestructural annotation of genomes, is thecentral underpinning theme. This linksgenome maps and sequences (Chapters1–5) with comparative genomics(Chapter 7) and functional genomics(Chapters 8–11). Chapters 8–11consider different aspects of functionalgenomics but, as shown by theoverlapping circles, none is a totallyisolated field. This figure conveys how aholistic approach is the best way tomine the genome for functionalinformation.

POGC01 7/12/06 11:55 AM Page 8

Page 17: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

The new science of genomics 9

Useful websites

http://www3.ncbi.nlm.nih.gov/This is the website of the National Center for BiotechnologyInformation. It contains links to many other useful websites.The OMIM pages on this site contain a wealth of informationon Mendelian inheritance in humans. This site also is the

entry point to PubMed which enables researchers to accessabstracts and journal articles on-line.

http://www.sciencemag.org/feature/plus/sfg/resources/This is the website on functional genomics resources hostedby Science magazine. It contains many useful pages and theones on model organisms are well worth visiting. There alsoare features pages which are revised regularly.

··

Term Definition

Genomics The study of the structure and function of the genome

Functional genomics The high throughput determination of the function of a gene product. Included within this definition is the expression of the gene, the relationship of the sequence and structure of the gene product to other gene products in the same or other organisms, and the molecular interactions of the gene product

Structural genomics The high throughput determination of structural motifs and complete protein structures and therelationship between these and function

Comparative genomics The use of sequence similarity and comparative gene order (synteny) to determine gene function andphylogeny

Proteomics The study of the proteome, i.e. the full complement of proteins made by a cell. The term includesprotein–protein and protein–small molecule interactions as well as expression profiling

Transcriptomics The study of the transcriptome, i.e. all the RNA molecules made by a cell, tissue or organism

Metabolomics The use of genome sequence analysis to determine the capability of a cell, tissue or organism to synthesize small molecules

Bioinformatics The branch of biology that deals with in silico processing and analysis of DNA, RNA and protein sequence data

Annotation The derivation of structural or functional information from unprocessed genomic DNA sequence data

Box 1.1 Genomics definitions used throughout this book

POGC01 7/12/06 11:55 AM Page 9

Page 18: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

··

CHAPTER 2

The organization and structure of genomes

Introduction

There is no such thing as a common genome struc-ture. Rather, there are major differences betweenthe genomes of bacteria, viruses and organelles onthe one hand and the nuclear genomes of eukary-otes on the other. Within the eukaryotes there aremajor differences in the types of sequences found,the amounts of DNA and the number of chromo-somes. This wide variability means that the map-ping and sequencing strategies involved depend onthe individual genome being studied.

Genome size

Because the different cells within a single organismcan be of different ploidy, e.g. germ cells are usuallyhaploid and somatic cells diploid, genome sizesalways relate to the haploid genome. The size of the

haploid genome also is known as the C-value.Measured C-values range from 3.5 × 103 bp for thesmallest viruses, e.g. coliphage MS2, to 1011 bp forsome amphibians and plants (Fig. 2.1). The largestviral genomes are 1–2 × 105 bp and are just a littlesmaller than the smallest cellular genomes, those ofsome mycoplasmas (5 × 105 bp). Simple unicellulareukaryotes have a genome size (1–2 × 107 bp) thatis not much larger than that of the largest bacterialgenomes. Primitive multicellular organisms such asnematodes have a genome size about four timeslarger. Not surprisingly, an examination of the gen-ome sizes of a wide range of organisms has shownthat the minimum C-value found in a particular phy-lum is related to the structural and organizationalcomplexity of the members of that phylum. Thus theminimum genome size is greater in organisms thatevolutionarily are more complex (Fig. 2.2).

A particularly interesting aspect of the data shownin Fig. 2.1 is the range of genome sizes found within

Flowering plants

Birds

Mammals

Reptiles

Amphibians

Bony fish

Cartilaginous fish

Echinoderms

Crustaceans

Insects

Molluscs

Worms

Fungi

Algae

Bacteria

Mycoplasmas

Viruses

103 104 105 106 107 108 109 1010 1011

DNA content (bp)

Fig. 2.1 The DNA content of thehaploid genome of a range of phyla. The range of values within a phylum isindicated by the shaded area. (Redrawnfrom Lewin 1994 by permission ofOxford University Press.)

POGC02 7/12/06 11:56 AM Page 10

Page 19: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

Organization and structure of genomes 11

each phylum. Within some phyla, e.g. mammals,there is only a twofold difference between the largestand smallest C-value. Within others, e.g. insects andplants, there is a 10- to 100-fold variation in size. Isthere really a 100-fold variation in the number ofgenes needed to specify different flowering plants?Are some plants really more organizationally com-plex than humans, as these data imply? Althoughthere is evidence that birds with smaller genomesare better flyers (Hughes & Hughes 1995) and thatplants are more responsive to elevated carbon diox-ide concentrations ( Jasienski & Bazzaz 1995) astheir genomes increase in size, this is not sufficient toexplain the size differential. The resolution of thisapparent C-value paradox was provided by the an-alysis of sequence complexity by means of reasso-ciation kinetics.

Sequence complexity

When double-stranded DNA in solution is heated, it denatures (‘melts’) releasing the complementarysingle strands. If the solution is cooled quickly theDNA remains in a single-stranded state. However, if the solution is cooled slowly reassociation willoccur. The conditions for efficient reassociation ofDNA were determined originally by Marmur et al.(1963) and since then have been extensively studiedby others (for a review, see Tijssen 1993). The keyparameters are as follows. First, there must be anadequate concentration of cations and below 0.01 m

sodium ion there is effectively no reassociation.Secondly, the temperature of incubation must behigh enough to weaken intrastrand secondarystructure. In practice, the optimum temperature forreassociation is 25°C below the melting temperature(Tm), that is, the temperature required to dissociate50% of the duplex. Thirdly, the incubation time andthe DNA concentration must be sufficient to per-mit an adequate number of collisions so that theDNA can reassociate. Finally, the size of the DNAfragments also affects the rate of reassociation and isconveniently controlled if the DNA is sheared tosmall fragments.

The reassociation of a pair of complementarysequences results from their collision and thereforethe rate depends on their concentration. As twostrands are involved the process follows second-order kinetics. Thus, if C is the concentration of DNAthat is single stranded at time t, then

where k is the reassociation rate constant. If C0 is theinitial concentration of single-stranded DNA at timet = 0, integrating the above equation gives

When the reassociation is half complete, C/C0 = 0.5and the above equation simplifies to

Thus the greater the C0t1/2 value, the slower thereaction time at a given DNA concentration. Moreimportant, for a given DNA concentration the half-period for reassociation is proportional to thenumber of different types of fragments (sequences)present and thus to the genome size (Britten &Kohne 1968). This can best be seen from the data inTable 2.1. Because the rate of reassociation dependson the concentration of complementary sequences,the C0t1/2 for organism B will be 200 times greaterthan for organism A.

Experimentally it has been shown that the rate ofreassociation is indeed dependent on genome size(Fig. 2.3). However, this proportionality is only truein the absence of repeated sequences. When the

C t0 1 2

1/ .=

k

C

C C t0 0

11

.=+ ⋅k

dd

kC

tC = − 2

··

106 107 108 109 1010105

Mycoplasmas

Bacteria

Yeasts

Worms, insects

Birds, amphibians

Mammals

Minimum genome size (bp)

Fig. 2.2 The minimum genome size found in a range oforganisms. (Redrawn from Lewin 1994 by permission ofOxford University Press.)

POGC02 7/12/06 11:56 AM Page 11

Page 20: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

12 Chapter 2

reassociation of calf thymus DNA was first studied,kinetic analysis indicated the presence of two com-ponents (Fig. 2.4). About 40% of the DNA had aC0t1/2 of 0.03, whereas the remaining 60% had aC0t1/2 of 3000. Thus the concentration of DNAsequences that reassociate rapidly is 100 000 times,the concentration of those sequences that reasso-ciate slowly. If the slow fraction is made up of uniquesequences, each of which occurs only once in thecalf genome, then the sequences of the rapid frac-tion must be repeated 100 000 times, on average.Thus the C0t1/2 value can be used to determine thesequence complexity of a DNA preparation. A com-parative analysis of DNA from different sources hasshown that repetitive DNA occurs widely in eukary-otes (Davidson & Britten 1973) and that different

··

Organism A Organism B

Starting DNA concentration (C0) 10 pg ml−1 10 pg ml−1

Genome size 0.01 pg 2 pgNo. of copies of genome per ml 1000 5Relative concentration (A vs. B) 200 1

Table 2.1 Comparison of sequencecopy number for two organisms withdifferent genome sizes.

Perc

enta

ge r

eass

ocia

ted

0

100

20

40

60

C0t (mol × s l–1)

10

30

50

708090

Poly

U ·

Poly

A

Mo

use

sate

llite

DN

A

Phag

e M

S2

Phag

e T4

Cal

f th

ymus

DN

A(n

on

-rep

etit

ive)

1 1010109108107106105104103102101

Nucleotide pairs

10–6 10–5 10–4 10–3 10–2 10–1 1 101 102 103 104

E. c

oli

DN

A

Perc

enta

ge r

eass

ocia

ted

0

100

20

40

60

C0t (mol × s l–1)

10

30

50

70

80

90

10–2 10–1 1 101 102 103 104

Fig. 2.3 Reassociation of double-stranded nucleic acids from varioussources. (Redrawn from Lewin 1994 bypermission of Oxford University Press.)

Fig. 2.4 The kinetics of reassociation of calf thymus DNA.Compare the shape of the curve with those shown in Fig. 2.3.

POGC02 7/12/06 11:56 AM Page 12

Page 21: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

Organization and structure of genomes 13

types of repeat are present. In the example shown in Fig. 2.5 a fast-renaturing and an intermediate-renaturing component can be recognized and arepresent in different copy numbers (500 000 and350, respectively) relative to the slow componentwhich is unique or non-repetitive DNA. The com-plexities of each of these components are 340 bp, 6 ×105 bp and 3 × 108 bp, respectively. The proportionof the genome that is occupied by non-repetitiveDNA versus repetitive DNA varies in different organ-isms (Fig. 2.6), thus resolving the C-value paradox.In general, the length of the non-repetitive DNAcomponent tends to increase as we go up the evolu-tionary tree to a maximum of 2 × 109 bp in mam-mals. The fact that many plants and animals have a much higher C-value is a reflection of the presenceof large amounts of repetitive DNA. Analysis of mes-senger RNA (mRNA) hybridization to DNA showsthat most of it anneals to non-repetitive DNA, i.e.most genes are present in non-repetitive DNA. Thusgenetic complexity is proportional to the content ofnon-repetitive DNA and not to genome size.

Introns and exons

Introns were initially discovered in the chicken ovalbumin and rabbit and mouse β-globin genes(Breatnach et al. 1977; Jeffreys & Flavell 1977).

Both these genes had been cloned by isolating themRNA from expressing cells and converting it tocomplementary DNA (cDNA). The next step was to use the cloned cDNA to investigate possible

··

Perc

enta

ge r

eass

ocia

ted

(1–

C/C

0)

100

0

C0t

75

50

20

Fastcomponent

25

0.0013

340

500000

Intermediatecomponent

30

1.9

6.0 × 105

350

Slowcomponent

45

630

3.0 × 108

1

Per cent ofgenome

C0t1/2

Complexity (bp)

Repetitionfrequency

10–4 10–3 10–2 10–1 1 101 102 103 104

Fig. 2.5 The reassociation kinetics of a eukaryotic DNAsample showing the presence of two types of repeated DNA. The arrows indicate the C0t1/2 values for the threecomponents. (Redrawn from Lewin 1994 by permission ofOxford University Press.)

Gen

ome

size

(bp

)

1010

109

108

107

106

E. coli C. elegans D. melanogaster X. laevis N. tabacumM. musculus

14% 13%

41%

65%

25% 30%

83%

70%

54%58%

3%

17%

5%

7%10%

100%

235

235

235

235

235

Non-repetitive

Highly repetitiveModerately repetitive

Fig. 2.6 The proportions of differentsequence components in representativeeukaryotic genomes. (Redrawn fromLewin 1994 by permission of OxfordUniversity Press.)

POGC02 7/12/06 11:56 AM Page 13

Page 22: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

14 Chapter 2

differences in the structure of the gene from express-ing and non-expressing cells. Here the Southern blothybridizations revealed a totally unanticipated situ-ation. It was expected that the analysis of genomicrestriction fragments generated by enzymes that didnot cut the cDNA would reveal only a single bandcorresponding to the entire gene. Instead severalbands were detected in the hybridized blots. The datacould be explained only by assuming the existence ofinterruptions in the middle of the protein-codingsequences. Furthermore, these insertions appearedto be present in both expressing and non-expressingcells. The gene insertions that are not translated intoprotein were termed introns and the sequences thatare translated were called exons.

Since the original discovery of introns, a largenumber of split genes has been identified in a widevariety of organisms. These introns are not restrictedto protein-coding genes for they have been found inrRNA and tRNA genes as well. Split genes are rare in

prokaryotes (Edgell et al. 2000; Martinez-Abarca &Toro 2000). They also are not particularly commonin lower eukaryotes (see below) but the mitochon-drial DNA of Euglena is an exception with 38% of the genome consisting of intron DNA (Hallick et al.1993).

In Saccharomyces cerevisiae, sequencing of the com-plete genome suggests that there are 235 intronscompared with over 6000 open-reading frames andthat introns account for less than 1% of the genome(Goffeau et al. 1996). Those genes which do haveintrons usually have only one small one and thelongest intron is only 1 kb in size.

However, proceeding up the evolutionary tree,the number of split genes, and the number and size ofintrons per gene, increases (Fig. 2.7 and Table 2.2).More important, genes that are related by evolutionhave exons of similar size, i.e. the introns are in the same position. However, the introns may vary in length, giving rise to variation in the length of

··

Per

cent

80

60

40

20

S. cerevisiae

Per

cent

30

20

10

D. melanogaster

Per

cent

15

5

Number of exons

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 <30 <40 <60 >601

Mammals

10

0

0

0

95%

17%

6%

Fig. 2.7 The number of exons in three representative eukaryotes. Uninterrupted genes have only one exon and are totalled in theleft-hand column. (Redrawn from Lewin 1994 by permission of Oxford University Press.)

POGC02 7/12/06 11:56 AM Page 14

Page 23: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

Organization and structure of genomes 15

the genes (Fig. 2.8). Note also that introns are much longer than exons, particularly in highereukaryotes.

If a split gene has been cloned, it is possible to sub-clone either the exon or the intron sequences. If these sub-clones are used as probes in genomicSouthern blots, it is possible to determine if thesesame sequences are present elsewhere in thegenome. Often, the exon sequences of one gene arefound to be related to sequences in one or more othergenes. Some examples of such gene families are givenin Table 2.3. In some instances the duplicated genes

are clustered, whereas in others they are dispersed.Also, the members may have related, or even ident-ical, functions, although they may be expressed atdifferent times or in different cell types. Thus dif-ferent globin proteins are found in embryonic andadult red blood cells, while different actins are foundin muscle and non-muscle cells.

Functional divergence between members of amultigene family may extend to the loss of genefunction by some members. Such pseudogenes comein two types. In the first type they retain the usualintron and exon structure but are functionless or

··

Table 2.2 Intron statistics for genes from different species.

Average Average Average Average mRNA % Exon Species exon number intron number length (kb) length (kb) per gene

Yeast 1 0 1.6 1.6 100Nematode 4 3 4.0 3.0 75Fruit fly 4 3 11.3 2.7 24Chicken 9 8 13.9 2.4 17Mammals 7 6 16.6 2.2 13

0 50 100 150

Number of amino acids

122 334 159

22999119

4800 3400

141113

130 850

38 76 115

32 68 103

31 105

31 99

30 104

161

143

153

141

146

Plant globin

Leghaemoglobin

Myoglobin

Human β-globin

Human α-globin

1098

876

8659

677

1418

Size ofgene(bp)

Fig. 2.8 The placement of introns in different members of the globinsuperfamily. The size of the introns in base pairs is indicated inside theinverted triangles. Note that the size of each polypeptide and the location of the different introns are relativelyconsistent.

POGC02 7/12/06 11:56 AM Page 15

Page 24: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

16 Chapter 2

they lack one or more exons. In the second type,found in dispersed gene families, processed pseu-dogenes are found which lack any sequences cor-responding to the introns or promoters of the functional gene members. Multiple copies of an exonalso may be found because the same exons occur inseveral apparently unrelated genes. Exons that areshared by several genes are likely to encode poly-peptide regions that endow the disparate proteinswith related properties, e.g. adenosine triphosphate(ATP) or DNA binding. Some genes appear to bemosaics that were constructed by patching togethercopies of individual exons recruited from differentgenes, a phenomenon known as exon shuffling (see pp. 31 and 114).

By contrast with exons, introns are not related to other sequences in the genome, although theycontain the majority of dispersed, highly repetitivesequences. Thus, for some genes the exons con-stitute slightly repetitive sequences embedded in aunique context of introns. It should be noted thatintrons are not necessarily junk because there nowis evidence that some of them encode functionalRNA (Moore 1996).

Two intron databases have been constructed(Schisler & Palmer 2000). The Intron DataBase(IDB) contains detailed information about introns

and the other, the Intron Evolution DataBase, pro-vides a statistical analysis of the intron and exonsequences catalogued in the IDB.

Genome structure in viruses and prokaryotes

The genomes of viruses and prokaryotes are verysimple structures, although those of viruses showremarkable diversity (for a review see Dimmock et al.2001). Most viruses have a single linear or circulargenome but a few, such as reoviruses, bacteriophageφ6 and some plant viruses, have segmented RNAgenomes. For a long time it was believed that alleubacterial genomes consisted of a single circularchromosome. However, linear chromosomes havebeen found in Borrelia sp., Streptomyces sp. andRhodococcus fascians and mapping suggests thatCoxiella burnetii also has a linear genome. Two chro-mosomes have been found in a number of bacteriaincluding Rhodobacter spheroides, Brucella melitensis,Leptospira interrogans and Agrobacterium tumefaciens(Cole & Saint Girons 1994). In the case of Agrobac-terium, there is one circular chromosome and onenon-homologous linear chromosome (Goodner et al.1999). Linear plasmids have been found in Borrelia

··

Approximate Clustered (L) orGene family Organism no. of genes dispersed (D)

Actin Yeast 1 –Slime mould 17 L, DDrosophila 6 DChicken 8–10 DHuman 20–30 D

Tubulin Yeast 3 DTrypanosome 30 LSea urchin 15 L, DMammals 25 D

a-Amylase Mouse 3 LRat 9 ?Barley 7 ?

b-Globin Human 6 LLemur 4 LMouse 7 LChicken 4 L

Table 2.3 Some examples of multigenefamilies.

POGC02 7/12/06 11:56 AM Page 16

Page 25: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

Organization and structure of genomes 17

sp. and Streptomyces sp. as well as a number of bacteria with circular chromosomes (Hinnebush &Tilley 1993). Borrelia has a very complex plasmidcontent with 12 linear molecules and nine circularmolecules (Casjens et al. 2000).

Bacterial genomes lack the centromeres found ineukaryotic chromosomes although there may be apartitioning system based on membrane adherence.Duplication of the genomes is initiated at an origin ofreplication and may proceed unidirectionally. Thestructure of the origin of replication, the oriC locus,has been extensively studied in a range of bacteriaand found to consist essentially of the same group ofgenes in a nearly identical order (Cole & Saint Girons1994). The oriC locus is defined as a region harbour-ing the dnaA (DNA initiation) or gyrB (B subunit of DNA gyrase) genes linked to a ribosomal RNAoperon.

Many bacterial and viral genomes are circular orcan adopt a circular conformation for the purposesof replication. However, those viral and bacterialgenomes which retain a linear configuration need aspecial mechanism to replicate the ends of the chro-mosome (see Box 2.1). A number of different strateg-ies for replicating the ends of linear molecules havebeen adopted by viruses (see Dimmock et al. 2001)but in bacteria there are two basic mechanisms(Volff & Altenbuchner 2000). In Borrelia, the chro-mosomes have covalently closed hairpin structures

at their termini. Such structures are also found inBorrelia plasmids, Escherichia coli phage N15, poxvir-uses and linear mitochondrial DNA molecules in theyeasts Williopsis and Pichia. Exactly how these hair-pin structures facilitate replication of the ends of themolecule is not known. By contrast, in Streptomyces,the linear molecules have proteins bound to the 5′ends of the DNA and such proteins are also found inadenoviruses, and a number of bacteriophages andfungal and plant mitochondrial plasmids. These ter-minal proteins probably are involved in the comple-tion of replication. In addition, Streptomyces linearreplicons have palindromic sequences and invertedrepeats at their termini.

The bacterial genomes that have been completelysequenced have sizes ranging from 0.6 to 7.6 Mb.The difference in size between the smallest and thelargest is not a result of introns for these are rare in prokaryotes (Edgell et al. 2000; Martinez-Abarca& Toro 2000). Nor is it a result of repeated DNA.Analysis of the kinetics of reassociation of denaturedbacterial DNA did not indicate the presence ofrepeated DNA in E. coli (Britten & Kohne 1968) andonly small amounts have been detected in all of thebacterial genomes that have been sequenced. Inboth Mycoplasma genitalium (0.58 Mb) and E. coli(4.6 Mb) about 90% of the genome is dedicated toprotein-coding genes. Therefore the differences insize reflect the number of genes carried. This begs the

··

The ends of eukaryotic chromosomes are also theends of linear duplex DNA and are known astelomeres. That these must have a special structurehas been known for a long time. For example, ifbreaks in DNA duplexes are not rapidly repaired byligation they undergo recombination or exonucleasedigestion, yet, the ends of chromosomes are stableand chromosomes are not ligated together. Also,DNA replication is initiated in a 5′→3′ direction with the aid of an RNA primer. After removal of thisprimer there is no way of completing the 5′ end of the molecule (Fig. B2.1). Thus, in the absence of amethod for completing the ends of the molecules,chromosomes would become shorter after each cell division.

Box 2.1 The need for telomeres

5’3’ 3’5’RNAprimer

5’ 3’Primer

excision

5’3’3’

5’

5’ 3’

3’

5’

5’

3’

3’

5’

5’

3’

Fig. B2.1 Formation of two daughter molecules withcomplementary single-stranded 3′ tails after primerexcision.

POGC02 7/12/06 11:56 AM Page 17

Page 26: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

18 Chapter 2

question as to the minimal genome size possible for a bacterium. The current view is about 300 genes(Mushegian 1999; see also p. 114). In Borrelia thesmall genome size (1.5 Mb) may be complementedby the high plasmid content (see above) which con-stitutes 0.67 Mb. These plasmids carry 535 genes,many of which have no counterparts in other organ-isms, suggesting that they perform specialized func-tions (Casjens et al. 2000).

The organization of organelle genomes

Mitochondria and chloroplasts both possess DNAgenomes that code for all of the RNA species andsome of the proteins involved in the functions of theorganelle. In some lower eukaryotes the mitochon-drial (mt) DNA is linear but more usually organellegenomes take the form of a single circular moleculeof DNA. Because each organelle contains severalcopies of the genome and because there are multipleorganelles per cell, organelle DNA constitutes arepetitive sequence. Whereas chloroplast (ct) DNAfalls in the range 120–200 kb, mtDNA varies enorm-ously in size. In animals it is relatively small, usu-ally less than 20 kb, but in plants it can be as big as2000 kb.

Organization of the chloroplast genome

The complete sequence of ctDNA has been reportedfor over a dozen plants including the single-celledprotist, Euglena (Hallick et al. 1993), a liverwort(Ohyama et al. 1986), and angiosperms such asArabidopsis, spinach, tobacco and rice (Shinozaki et al. 1986; Hiratsuka et al. 1989; Sato et al. 1999;

Schmitz-Linneweber et al. 2001). Overall, there is a remarkable similarity in size and organization (Fig. 2.9 and Table 2.4). The differences in size areaccounted for by differences in length of introns andintergenic regions and the number of genes. A gen-eral feature of ctDNA is a 10–24 kb sequence that ispresent in two identical copies as an inverted repeat.The proportion of the genome that is represented byintrons can be very high, e.g. in Euglena it is 38%.

The chloroplast genome encodes 70–90 proteins,including those involved in photosynthesis, fourrRNA genes and 30 or more tRNA genes. ChloroplastmRNAs are translated with the standard geneticcode (cf. mitochondrial mRNA). However, edit-ing events cause the primary structures of several

··

Invertedrepeats

Short unique sequence

Long unique sequence

Fig. 2.9 Generalized structure of ctDNA.

Feature Arabidopsis Spinach Maize

Inverted repeats 26 264 bp 25 073 bp 22 748 bpShort unique sequence 17 780 bp 17 860 bp 12 536 bpLong unique sequence 84 170 bp 82 719 bp 82 355 bpLength of total genome 154 478 bp 150 725 bp 140 387 bpNumber of genes 108 108 104rRNA genes 4 4 4tRNA genes 37 30 30Protein-encoding genes 87 74 70

Table 2.4 Key features of chloroplastDNA.

POGC02 7/12/06 11:56 AM Page 18

Page 27: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

Organization and structure of genomes 19

transcripts to deviate from the corresponding gen-omic sequences by C to U transitions with a strongbias for changes at the second codon position (Maieret al. 1995). This editing makes it difficult to convertchloroplast nucleotide sequences into amino acidsequences for the corresponding gene products.

Astasia longa is a colourless heterotrophic flagel-late which is closely related to Euglena gracilis. It con-tains a plastid DNA that is 73 kb in length, abouthalf the size of the ctDNA from Euglena. Sequencingof this plastid DNA has shown that all chloroplastgenes for photosynthesis-related proteins, exceptthat encoding the large subunit of ribulose-1, 5-bisphosphate carboxylase, are missing (Gockel &Hachtel 2000).

Organization of the mitochondrial genome

Mitochondrial DNA (mtDNA) is an essential compon-ent of all eukaryotic cells. It ensures consistency offunction (cellular respiration and oxidative phos-phorylation) despite the great diversity of genomeorganization. However, many of the genes for mito-chondrial proteins are found in the nucleus. Someorganisms use the standard genetic code to translatenuclear mRNAs and a different code for their mito-chondrial mRNAs.

The mtDNA from animals is about 15–17 kb insize, nearly always circular and very compact. Itencodes 37 genes: 13 for proteins, 22 for tRNAs andtwo for rRNAs. There are no introns and very littleintergenic space. It has been found that mtDNA cansurvive in museum specimens and palaeontologicalremains and so is proving useful for the study of thegenetic relationships of extinct animals (Hofreiteret al. 2001).

Plant mtDNA is much larger than that from ani-mals and in angiosperms ranges from 200 kb to 2 Mb (i.e. larger than some bacterial genomes).Plant mtDNAs rival the eukaryotic nucleus in termsof the C-value paradox they present. That is, largerplant mt genomes do not contain more genes thansmaller ones but simply have more spacer DNA.Plant mtDNAs do have introns but there is little vari-ation in intron content and size across the range ofangiosperms. As an example of the C-value paradox,Arabidopsis mtDNA is 367 kb in size whereas humanmtDNA is 16.6 kb in size but the former has only 14more genes (Marienfeld et al. 1999). Angiosperm

mtDNAs are larger than their animal counterpartspartly because of frequent duplications and partlybecause of the frequent capture of DNA sequencesfrom the chloroplast and the nucleus. Plant mtDNAsalso can lose sequences to the nucleus (Palmer et al.2000).

Fungal mtDNAs resemble plant mtDNAs in thatthey are larger than those from animals and moreheterogeneous in size, e.g. Saccharomyces mtDNA is86 kb in size whereas that from Mucor is 34 kb(Foury et al. 1998; Papp et al. 1999). Fungal mtDNAalso contains introns.

Plant mtDNAs differ from those of animals andlower eukaryotes in more than just size. At thesequence level they have an exceptionally low rate of point mutations that is 50–100 times lowerthan that seen in vertebrate mitochondria. At thestructural level, plant mtDNAs have a high rate ofgenomic rearrangement and duplication.

The organization of nuclear DNA in eukaryotes

Gross anatomy

Each eukaryotic nucleus encloses a fixed number of chromosomes which contain the nuclear DNA.During most of a cell’s life, its chromosomes exist in a highly extended linear form. Prior to cell division,however, they condense into much more compactbodies which can be examined microscopically afterstaining. The duplication of chromosomes occurschiefly when they are in the extended stage (inter-phase). One part of the chromosome, however,always duplicates during the contracted metaphasestate. This is the centromere, a body that controls themovement of the chromosome during mitosis. Itsstructure is discussed later (p. 30).

The ends of eukaryotic chromosomes are also theends of linear duplex DNA and are known as telo-meres. That these must have a special structure hasbeen known for a long time (see Box 2.1). In mosteukaryotes the telomere consists of a short repeat of TTAGGG many hundreds of units long but inSaccharomyces cerevisiae the repeat unit is TG1–3.These repeats vary considerably in length betweenspecies (Table 2.5) but each species maintains a fixedaverage telomere length in its germline. The enzyme

··

POGC02 7/12/06 11:56 AM Page 19

Page 28: POGA01 7/12/06 11:39 AM Page i Principles of Genome ... · genomics, 1 Introduction, 1 Physical mapping of genomes, 1 Sequencing whole genomes, 3 Benefits of genome sequencing, 5

20 Chapter 2

telomerase is responsible for maintaining the integ-rity of telomeres (see Box 2.2).

The fruit fly Drosophila differs from most othereukaryotes in having an unusually elaboratemethod for forming chromosome ends. Instead oftelomere repeats it has telomere-specific transpos-

able elements (Pardue et al. 1996) that resemblelong interspersed nuclear elements (LINEs) (see p. 28).

In many eukaryotes, a variety of treatments willcause chromosomes in dividing cells to appear as aseries of light- and dark-staining bands (Fig. 2.10).In G-banding, for example, the chromosomes aresubjected to controlled digestion with trypsin beforeGiemsa staining which reveals alternating posit-ively (dark G-bands) and negatively (R-bands or paleG-bands) staining regions. As many as 2000 lightand dark bands can be seen along some mammalianchromosomes. An identical banding pattern (Q-banding) can be seen if the Giemsa stain is replacedwith a fluorescent dye such as quinacrine whichintercalates between the bases of DNA. A structuralbasis for metaphase bands has been proposed that isbased on the differential size and packing of DNAloops and matrix-attachment sites in G- vs. R-bands

··

Table 2.5 Length of the telomere repeat in differenteukaryotic species.

Species Length of telomere repeat

S. cerevisiae (yeast) 300 bpMouse 50 kbHuman 10 kbArabidopsis 2–5 kbCereals 12–15 kbTobacco 60–160 kb

Telomeres are found at the ends of chromosomes andtheir role is to protect the ends of chromosomes. Theyalso stop chromosomes from fusing to each other.Telomeres consist of repeating units of TTAGGG thatcan be up to 15 000 bp in length. The very ends ofthe chromosomes are not blunt-ended but have 3′single-stranded overhangs of 12 or more nucleotides.The enzyme telomerase, also called telomere terminaltransferase, is a ribonucleoprotein enzyme whoseRNA component binds to the single-stranded end ofthe telomere. An associated reverse transcriptaseactivity is able to maintain the length and structureof telomeres by the mechanism shown in Fig. B2.2.For a detailed review of the mechanism of telomeremaintenance the reader should consult the paper byShore (2001).

Telomerase is found in fetal tissues, adult germcells and cancer cells. In normal somatic cells theactivity of telomerase is very low and each time thecells divide some of the telomere (25–200 bp) is lost. When the telomere becomes too short thechromosome no longer divides and the host cell dies by a process known as apoptosis. Thus, normal

Box 2.2 Telomerase, immortality and cancer

somatic cells are mortal and in tissue culture they willundergo 50–60 divisions (Hayflick limit) before theysenesce. In contrast to mammals, indeterminatelygrowing multicellular organisms, such as fish andcrustaceae, maintain unlimited growth potentialthroughout their entire life and retain telomeraseactivity (for a review see Krupp et al. 2000).

Cancer cells can divide indefinitely in tissue cultureand thus are immortal. Telomerase has been found incancer cells at activities 10- to 20-fold greater than innormal cells. This presence of telomerase confers aselective growth advantage on cancer cells and allowsthem to grow uncontrollably. Telomerase is an idealtarget for chemotherapy because this enzyme is activein most tumours but inactive in most normal cells.

If recombinant DNA technology is used to expresstelomerase in human somatic cells maintained in culture, senescence is avoided and the cellsbecome immortal. This immortalization usually isaccompanied by an increased expression of the c-myconcogene to the levels seen in many cancer cell lines.

Chromosomal rearrangements involving telomeresare emerging as an important cause of human

continued

POGC02 7/12/06 11:56 AM Page 20