The Dog Genome

8/9/2019 The Dog Genome

1/21

Canis lupus familiaris, the dog, was domesticated by

humans from the gray wolf thousands of years ago.

While there are many kinds of wolves, they all look

more or less the same. Not so with mans best friend. The

American Kennel Club recognizes about 155 different

breeds. Dog breeds not only look different, they vary

greatly in size. For example, an adult Chihuahua weighs just

1.5 kg, while a Scottish deerhound weighs 70 kg. No other

mammal shows such large phenotypic variation, and biolo-gists are curious about how this occurs. Also, there are hun-

dreds of genetic diseases in dogs, and many of these dis-

eases have counterparts in humans. To find out ab

genes behind the phenotypic variation, and to elu

the relationships between genes and diseases, the

Genome Project began in the late 1990s. Since the

quences of several dog genomes have been publi

Two dogsa boxer and a poodlewere the fir

have their entire genomes sequenced. The dog ge

contains 2.8 billion base pairs of DNA in 39 pairs of

somes. There are 19,000 protein-coding genes, mothem with close counterparts in other mammals, i

humans. The whole genome sequence made it ea

ate a map of genetic markersspecific nucleotide

sequences of DNA at particular locations on the ge

that differ between individual dogs and/or breeds

Genetic markers are used to map the locations

thus identify) genes that control particular traits.

ample, Dr. Elaine Ostrander and her colleagues a

tional Institutes of Health studied Portuguese wa

to identify genes that control size. Taking sample

for DNA isolation was relatively easy: a cotton swswept over the inside of the cheek. As Dr. Ostran

the dogs didnt care, especially if they thought t

going to get a treat or if there was a tennis ball in

other hand. It turned out that the gene for insul

growth factor 1 (IGF-1) is important in determinin

large breeds have an allele that codes for an activ

and small breeds have a different allele that code

less active IGF-1.

Another gene important to phenotypic variati

found in whippets, sleek dogs that run fast and a

raced. A mutation in the gene for myostatin, a pr

that inhibits overdevelopment of muscles, result

The dog genome

17

Variation in Dogs The Chihuahua (bottom) and the

mastiff (top) are the same species, Canis lupus familiari

show great variation in size. Genome sequencing has r

insights into how size is controlled by genes.

PART FIVE GENOMES

This material cannot be copied, reproduced, manufactured, or disseminaany form without express written permission from the publisher.

2010 Sinauer Associates, Inc.


2/21

17.1 How Are Genomes Sequenced?As you saw in the opening story on dogs, one reason foquencing genomes is to compare different organisms. Anois to identify changes in the genome that result in diseas1986, the Nobel laureate Renato Dulbecco and others propthat the world scientific community be mobilized to untake the sequencing of the entire human genome. One chall

discussed at the time was to detect DNAdamage in peoplehad survived the atomic bomb attacks and been exposeradiation in Japan during World War II. But in order to dchanges in the human genome, scientists first needed to kits normal sequence.

The result was the publicly funded Human Genome ject, an enormous undertaking that was successfully compin 2003. This effort was aided and complemented by privfunded groups. The project benefited from the developmof many new methods that were first used in the sequencinsmaller genomesthose of prokaryotes and simple eukary

Two approaches were used to sequence the

human genome

Many prokaryotes have a single chromosome, while eukotes have several to many. Because of their differing sizes, cmosomes can be separated from one another, identified, anperimentally manipulated. It might seem that the mstraightforward approach to sequencing a chromosome w

be to start at one end and simply sequence the entire DNAecule. However, this approach is not practical since only a700 base pairs can be sequenced at a time using current mods. Prokaryotic chromosomes contain 14 million base pand human chromosome 1 contains 246 million base pairs

To sequence an entire genome, chromosomal DNA mucut into short fragments about 500 base pairs long, whic

separated and sequenced. For the haploid human genowhich has about 3.3 billion base pairs, there are more than 6lion such fragments. When all of the fragments have beequenced, the problem becomes how to put these millions oquences together. This task can be accomplished using laoverlapping fragments.

Lets illustrate this process using a single, 10 base-pairDNA molecule. (This is a double-stranded molecule, buconvenience we show only the sequence of the nonco

IN THIS CHAPTER we look at genomes. First we look

at how large molecules of DNA are cut and sequenced, and

what kinds of information these genome sequences pro-

vide. Then we turn to the results of ongoing sequencing ef-

forts in both prokaryotes and eukaryotes. We next consider

the human genome and some of the real and potential uses

of human genome information. Finally, we will describe the

emerging fields of proteomics and metabolomics, which at-

tempt to give a complete inventory of a cells proteins and

metabolic activity.

whippet that is more muscular and runs faster. Myo-

statin is important in human muscles as well.

Inevitably, some scientists have set up companies to

test dogs for genetic variations, using DNA supplied by

anxious owners and breeders. Some traditional breeders

frown on this practice, but others say it will improve the

breeds and give more joy (and prestige) to owners. So

the issues surrounding the Dog Genome Project are not

very different from ones arising from the Human

Genome Project.

Powerful methods have been developed to analyze

DNA sequences, and the resulting information is accu-mulating at a rapid rate. Comparisons of sequenced

genomes are providing new insights into evolutionary

relationships and confirming old ones. We are in a new

era of biology.

CHAPTER OUTLINE

17.1 How Are Genomes Sequenced?

17.2 What Have We Learned from SequencingProkaryotic Genomes?

17.3 What Have We Learned from SequencingEukaryotic Genomes?

17.4 What Are the Characteristics of the HumanGenome?

17.5 What Do the New Disciplines of Proteomics andMetabolomics Reveal?

Genetic Bully These dogs are both whippets, but the mus-

cle-bound dog (right) has a mutation in a gene that limits

muscle buildup.

This material cannot be copied, reproduced, manufactured, or disseminated inany form without express written permission from the publisher.



3/21

strand.) The molecule is cut three ways. The first cut generatesthe fragments:

TG, ATG, and CCTAC

The second cut of the same molecule generates the fragments:

AT, GCC, and TACTG

The third cut results in:

CTG, CTA, and ATGC

Can you put the fragments into the correct order? (The answeris ATGCCTACTG.) Of course, the problem of ordering 6 mil-lion fragments, each about 500 bp long, is more of a challenge!The field of bioinformatics was developed to analyze DNA se-quences using complex mathematics and computer programs.

Until recently, two broad approaches were used to analyzeDNAfragments for alignment: hierarchical sequencing and shot-gun sequencing. These were developed for the Human GenomeProject, but have been applied to other organisms as well.

HIERARCHICAL SEQUENCING The publicly funded human

genome sequencing team developed a method known as hier-archical sequencing. The first step was to systematically iden-tify short marker sequences along the chromosomes, ensuringthat every fragment of DNA to be sequenced would contain amarker (Figure 17.1A). Genetic markers can be short tandem re-peats (STRs), single nucleotide polymorphisms (SNPs), or therecognition sites for restriction enzymes, which recognize and cutDNA at specific sequences (see Chapter 15).

Some restriction enzymes recognize sequences of 4pairs and generate many fragments from a large DNcule. For example, the enzyme Sau3A cuts DNA everyencounters GATC. Other restriction enzymes recogquences of 812 base pairs (NotI cuts at GCGGCCGC, fople) and generate far fewer, but much larger, fragment

In hierarchical sequencing, genomic DNAis cut up inof relatively large (55,000 to 2 million bp) fragments. If d

enzymes are used in separate digests, the fragments willso that some fragments share particular markers. Each fris inserted into a bacterial plasmid to create a bacterial chromosome (BAC), which is then inserted into bacter

bacterium gets just one plasmid with its fragment of (fople) the human genome and is allowed to grow into a colotaining millions of genetically identical bacteria (called Clones differ from one another in that each has a differement from the human genome. Acollection of clones, conmany different fragments of a genome, is called a genomi

The DNA from each clone is then extracted and smaller overlapping pieces, which in turn are cloned, pand sequenced. The overlapping parts of the sequences a

searchers (with the aid of computers) to align them to crcomplete sequence of the BAC clone. The genetic markersBAC clone are used to arrange the larger fragments in theorder along the chromosome map. This method works,slow. An alternative approach, shotgun sequencing, mgreater use of use of computers to align the sequences.

17.1 | HOW ARE GENOMES SEQUENCED?



TOOLS FOR INVESTIGATING LIFE17.1 Sequencing GenomesInvolvesFragment Overlaps

Short fragmentsof the whole genome can be sequenced, but then the fragmentsmust be correctly aligned. Historically two approacheswere used. Both involved the use of bacterial clonesto separate and amplify individual DNA fragments.

Markers

3

2

1

Each fragmentisamplified andthen sequenced.

2

A marker mapismade on alarge DNA.

1 DNA israndomlybroken into 500 bpfragments. Severalcutsare made tocreate overlappingfragments.

3 A computer findssequencesshared by fragments(overlapsand alignsthe fragments.

The DNA iscut intofragmentsof 55,000 to2 million bp. Severalcutsare made to createoverlapping fragments.

Each fragment isamplified

in a bacterial artificialchromosome (BAC).

4 Marker sequencesare identifiedon the fragments; common onesindicate overlap.

5 The BAC fragmentsare cut intosmall piecesand sequenced frommarker to marker, 500 bp at a time.

(A) Hierarchical sequencing (B) Shotgun sequencing

GO TOAnimated Tutorial 17.1 Sequencing the Geno

yo urBioPo rtal .com


4/21

SHOTGUN SEQUENCING Instead mapping the genome and creatingBAC library, the shotgun sequencinmethod involves directly cutting gnomic DNAinto smaller, overlappinfragments that are cloned and squenced. Powerful computers aligthe fragments by finding sequence h

mologies in the overlapping region(Figure 17.1B). As sequencing tecnologies and computers have improved, the shotgun approach has bcome much faster and cheaper thathe hierarchical approach.

As a demonstration, researcheused this method to sequence a 1million-base-pair prokaryotic genomin just a few months. Next came larggenomes. The entire 180 million-baspair fruit fly genome was sequence

by the shotgun method in little over

year. This success proved that thshotgun method might work for thmuch larger human genome, and fact it was used to sequence the hman genome rapidly relative to the herarchical method.

The nucleotide sequenceof DNA can be determined

How are the individual DNA framents generated by the hierarchicor shotgun methods sequenced? Curent techniques are variations ofmethod developed in the late 197

by Frederick Sanger. This methouses chemically modified nucleotidthat were originally developed stop cell division in cancer. As wdiscuss in Chapter 13, deoxyriboncleoside triphosphates (dNTPs) athe normal substrates for DNArepcation, and contain the sugar doxyribose. If that sugar is replacewith 2,3-dideoxyribose, the resultindideoxyribonucleoside triphospha(ddNTP) will still be added by DN

polymerase to a growing polyncleotide chain. However, becauthe ddNTP has no hydroxyl grou(OH) at the 3 position, the next ncleotide cannot be added (Figu17.2A). Thus synthesis stops at thposition where ddNTP has been icorporated into the growing end ofDNAstrand.

368 CHAPTER 17 | GENOMES



TOOLS FOR INVESTIGATING LIFE17.2 Sequencing DNA

(A) The normal substratesfor DNA replication are dNTPs. The chemically modified structure

of ddNTPscausesDNA synthesisto stop. (B) When labeled ddNTPsare incorporated into areaction mixture for replicating a DNA template of unknown sequence, the result isa

collection of fragmentsof varying lengthsthat can be separated by electrophoresis.

The DNA fragment for which the basesequence isto be determined isisolated and servesasthe template.

Absence of OH at the 3position meansthat additional nucleotidescannot be added.

1

A sample of the unknown DNA iscombinedwith primer, DNA polymerase, dNTPs, andthe fluorescent ddNTPs. Synthesisbegins.

3

4

Each of the ddNTPsisbound to afluorescent dye.

2

The resultsare illustratedhere by what bindsto a Tin the unknown template.

If ddATP isadded,synthesisstops. A seriesof fragmentsof differentlengthsismade, eachending with a ddNTP.

A

6 Each fragment fluorescesa colorthat identifiesthe ddNTP thatterminated the fragment. The colorat the end of each fragment isdetected by a laser beam.

7 The sequence of the DNAcan now be deduced fromthe colorsof each fragment

5 The newly synthesized fragmentsof variouslengthsare separatedby electrophoresis.

8 and converted tothe sequence of thetemplate strand.

Deoxyribonucleosidetriphosphate (dNTP)(normal)

Dideoxyribonucleosidetriphosphate (ddNTP)(chemically modified)

ddCTP ddGTP ddTTP ddATP

Primer(sequence known)

Template strand

Longestfragment

Shortestfragment

Electrophoresis

T

C

T

G

G

G

C

A

C

T

T

A

A

(A) Base(A, T, G, or C)

Base(A, T, G, or C)

5 3

5

5

3

3

5

35

5

5

3

5

3

35

3

5

3

A A T C T G G G C T A T T C G G

Detector

T

G

G

(B)

Laser

CH2

H HH H

HO

OOO P

O

O

PP O

O

O

O

O

O

CH2

H HH H

OOO P

O

O

PP O

O

O

O

O

O

2 3 2

????????????????????

????????????????CGCA

GCGT

TT??????????????CGCA

ATCTGGGCTATTCGGGCGT

AATCTGGGCTATTCGGGCGT

ATGC

H H H

3

TTAGACCCGATAAGCCCGCA

T???? ???????????CGCA


5/21

To determine the sequence of a DNA fragment (usually nomore than 700 base pairs long), it is isolated and mixed with

DNA polymerase

A short primer appropriate for the DNAsequence

The four dNTPs (dATP, dGTP, dCTP, and dTTP)

Small amounts of the four ddNTPs, each bondedto a differently colored fluorescent tag

In the first step of the reaction, the DNAis heated todenature it (separate it into single strands). Only oneof these strands will act as a template for sequenc-ingthe one to which the primer binds. DNArepli-cation proceeds, and the test tube soon contains amixture of the original DNAstrands and shorter, newcomplementary strands. The new strands, each end-ing with a fluorescent ddNTP, are of varying lengths.For example, each time a T is reached on the templatestrand, DNA polymerase adds either a dATP or addATP to the growing complementary strand. IfdATP is added, the strand continues to grow. IfddATP is added, growth stops (Figure 17.2B).

After DNAreplication has been allowed to pro-ceed for a while, the new DNAfragments are dena-tured and the single-stranded fragments separated

by electrophoresis (see Figure 15.8), which sorts theDNAfragments by length. During the electrophore-sis run, the fragments pass through a laser beamthat excites the fluorescent tags, and the distinctivecolor of light emitted by each ddNTP is detected.The color indicates which ddNTP is at the end ofeach strand. Acomputer processes this informationand prints out the DNA sequence of the fragment(see Figure 17.2B).

The delivery of chemical reagents by automatedmachines, coupled with automated analysis, hasmade DNAsequencing faster than ever. Huge labo-ratories often have 80 sequencing machines operat-ing at once, each of which can sequence and analyzeup to 70,000 bp in a typical 4-hour run. This may befast enough for a prokaryotic genome with 1.5 mil-lion base pairs (20 runs), but when it comes to rou-tine sequencing of larger genomes (like the 3.3 bil-lion-base-pair human genome), even more speed isneeded.

High-throughput sequencing has beendeveloped for large genomes

The first decade of the new millennium has seenrapid development of high-throughput sequenc-ingmethodsfast, cheap ways to sequence and an-alyze large genomes. A variety of different ap-proaches are being used. They generally involvethe amplification of DNA templates by the poly-merase chain reaction (PCR; see Section 13.5), andthe physical binding of template DNA to a solidsurface or to tiny beads called microbeads. These

techniques are often referred to as massively parallel DNAing, because thousands or millions of sequencing reactirun at once to greatly speed up the process. One sucthroughput method is illustrated in Figure 17.3. In onerun, these machines can sequence 50,000,000 base pairs oHow does it work?

17.1 | HOW ARE GENOMES SEQUENCED?



TOOLS FOR INVESTIGATING L

17.3 High-Throughput Sequencing

High-speed sequencing isfaster and cheaper than traditional methodsand involvesthe chemical amplification of DNA fragments. One examp

high-throughput sequencing isshown here.

1 A large DNiscut into fof 300800denatured strands.

2 Each single-strandDNA fragment isattached to a micr

3 PCR amplifieseacfragment to 2 millicopiesper bead.

5 DNA seqdone onebase at aread by ascanner.

6 The sequencisanalyzed bcomputer.

4 Each bead isput a microwell on a p

DNA

Microbead

Fluorescentbase

Sequencing lab

GO TOAnimated Tutorial 17.2 High-Throughput Sequencing

yo urB ioPo rtal .com


6/21

For massively parallel sequencing using microbeads, the ge-nomic DNAis first cut into 300- to 800-base-pair fragments. Thefragments are denatured to single strands and attached to tiny

beads that are less than 20m in diameter, one DNAfragment(template) per bead. PCR is used to create several million iden-tical copies of the fragment on each bead. Then each bead isloaded into a tiny (40 m diameter) well in a multi-well plate,and the sequencing begins.

The automated sequencer adds a reaction mix like the one de-scribed above, but containing only one of four fluorescently la-

beled dNTPs. That nucleotide will become incorporated as thefirst nucleotide in a complementary strand only in wells wherethe first nucleotide in the template strand can base-pair with it.For example, if the first nucleotide on the template in well #1has base T, then a fluorescent nucleotide with base A will bindto that well. Next, the reaction mix is removed and a scannercaptures an image of the plate, indicating which wells containthe fluorescent nucleotide. This process is repeated with a dif-ferent labeled nucleotide. The machine cycles through many re-peats using all four dNTPs, and records which wells gain newnucleotides after each cycle. Acomputer then identifies the se-

quence of nucleotides that were gained by each well, and alignsthe fragments to provide the complete sequence of the genome.

This method was used to sequence the genome of James Wat-son, codiscoverer of the DNAdouble helix. It took less than twomonths and cost less than $1 million. Sequencing methods are

being continually refined to increase speed and accuracy anddecrease costs.

Genome sequences yield several kinds of information

New genome sequences are published more and more frquently, creating a torrent of biological information (Figure 17.4In general, biologists use sequence information to identify:

Open reading frames, the coding regions of genes. For pro-tein-coding genes, these regions can be recognized by the

start and stop codons for translation, and by intron consensus sequences that indicate the locations of introns.

Amino acid sequences of proteins, which can be deducedfrom the DNAsequences of open reading frames by applying the genetic code (see Figure 14.6).

Regulatory sequences, such as promoters and terminators fotranscription.

RNA genes, including rRNA, tRNA, and small nuclear RN(snRNA) genes.

Other noncoding sequences that can be classified into varioucategories including centromeric and telomeric regions, nuclear matrix attachment regions, transposons, and repeti-tive sequences such as short tandem repeats.

Sequence information is also used for comparative genomics, thcomparison of a newly sequenced genome (or parts thereowith sequences from other organisms. This can give informtion about the functions of sequences, and can be used to traevolutionary relationships among different organisms.




A chromosomehasa single DNA molecule with specializedDNA sequencesfor the initiation of DNA replication, for spindleinteractionsin mitosis(centromeres), and for maintaining theintegrity of the ends(telomeres). SeeChapters11 and 12.

Chromatinremodeling altersgenome packaging andtherefore gene expression.SeeChapter 16.

Large chromosomescontain multipleoriginsfor DNAreplication. SeeChapter 13.

Centromeresequences

Histones

DNA replication machinery

Telomeresequences

Chromosome

17.4 The Genomic Book of Life Genome sequences

contain many features, some of which are summarized in thioverview. Sifting through all the information contained in a

genome sequence can help us understand how an organismfunctions and what its evolutionary history might be.


7/21

17.1 RECAP

The sequencing of genomes required the develop-ment of ways to cut large chromosomes into frag-ments, sequence the fragments, and then line themup on the chromosome. Two ways to do this arehierarchical sequencing and shotgun sequencing.Today new procedures are being developed that

require automation and powerful computers. ActualDNA sequencing involves labeled nucleotides thatare detected at the ends of growing polynucleotidechains.

What are the hierarchical and shotgun approaches togenome analysis? See pp. 367368 and Figure 17.1

What is the dideoxy method for DNA sequencing?See pp. 368369 and Figure 17.2

Explain how high-throughput sequencing methodswork. See pp. 369370 and Figure 17.3

How are open reading frames recognized in a ge-

nomic sequence? What kind of information can bederived from an open reading frame? See p. 370

We now turn to the first organisms whose sequences were de-termined, prokaryotes, and the information these sequencesprovided.

What Have We Learned from17.2 Sequencing Prokaryotic Genomes

When DNA sequencing became possible in the late 19first life forms to be sequenced were the simplest virustheir relatively small genomes. The sequences quickly pnew information on how these viruses infect their hostsproduce. But the manual sequencing techniques used on

were not up to the task of studying the genomes of potes and eukaryotes. The newer, automated sequencinniques we just described made such studies possible. Whave genome sequences for many prokaryotes, to the grefit of microbiology and medicine.

The sequencing of prokaryotic genomes led tonew genomics disciplines

In 1995 a team led by Craig Venter and Hamilton Smithmined the first complete genomic sequence of a free-livlular organism, the bacteriumHaemophilus influenzaemore prokaryotic sequences have followed, revealing n

how prokaryotes apportion their genes to perform diffelular functions, but also how their specialized functionsried out. Soon we may even be able to ask the provocativtion of what the minimal requirements of a living cell m

FUNCTIONAL GENOMICS Functional genomics is the bidiscipline that assigns functions to the products of gen

17.2 | WHAT HAVE WE LEARNED FROM SEQUENCING PROKARYOTIC GENOMES?



Gene expressionoccursat open reading frames,from which RNA polymerase transcribesmRNAsthat are translated to form proteins. GenescontainDNA sequencesfor control of their expression.SeeChapters1416.

RNAgenesencode RNAsthat are nottranslated into proteins. These RNAsinclude rRNA and tRNA, which are partof the protein translation machinery(Chapter 14), and miRNAsinvolved incontrol of gene expression (Chapter 16).

Noncoding sequenceson the genomeinclude highly repetitive sequencesandtransposons. SeeChapters15 and 17.

RNApolymerase

RNA genes

Noncodingsequences

tRNA

Promoter oftranscription

Terminator oftranscriptionmRNA

Open reading frame(protein coding sequence)

Epigenetic modificationof gene: methylation


8/21

field, less than 15 years old, is now a major occupationof biologists. Lets see how funtional genomics methodswere applied to the bacteriumH. influenzae once its se-quence was known.

The only host forH. influenzae is humans. It lives inthe upper respiratory tract and can cause ear infectionsor, more seriously, meningitis in children. Its single cir-cular chromosome has 1,830,138 base pairs (Figure 17.5).In addition to its origin of replication and the genes cod-

ing for rRNAs and tRNAs, this bacterial chromosomehas 1,738 open reading frames with promoters nearby.

When this sequence was first announced, only 1,007(58 percent) of the open reading frames coded for pro-teins with known functions. The remaining 42 percentcoded for proteins whose functions were unknown.Since then scientists have identified many of these pro-teins roles. For example, they found genes for enzymesof glycolysis, fermentation, and electron transport. Othergene sequences code for membrane proteins, includingthose involved in active transport. An important find-ing was that highly infective strains ofH. influenzae, butnot noninfective strains, have genes for surface proteins

that attach the bacterium to the human respiratory tract.These surface proteins are now a focus of research onpossible treatments forH. influenzae infections.

COMPARATIVE GENOMICS Soon after the sequence ofH.influenzae was announced, smaller (Mycoplasma genital-ium; 580,073 base pairs) and larger (E. coli; 4,639,221 basepairs) prokaryotic sequences were completed. Thus be-gan a new era in biology, that of comparative genomics,

which compares genome sequences from different organismScientists can identify genes that are present in one bacteriuand missing in another, allowing them to relate these genes t

bacterial function.M. genitalium, for example, lacks the enzymes needed to sy

thesize amino acids, which E. coli andH. influenzae both posess. This finding reveals thatM. genitalium must obtain all iamino acids from its environment (usually the human uroge

ital tract). Furthermore, E. coli has 55 regulatory genes codinfor transcriptional activators and 58 for repressors;M. genitaium only has 3 genes for activators. What do such findings teus about an organisms lifestyle? For example, is the biochemical flexibility ofM. genitalium limited by its relative lack of cotrol over gene expression?

Some sequences of DNA can move about the genome

Genome sequencing allowed scientists to study more broada class of DNA sequences that had been discovered by genecists decades earlier. Segments of DNA called transposabelements can move from place to place in the genome and ca

even be inserted into another piece of DNAin the same cell (e.ga plasmid). A transposable element might be at one location the genome of one E. coli strain, and at a different location in aother strain. The insertion of this movable DNAsequence froelsewhere in the genome into the middle of a protein-codingene disrupts that gene (Figure 17.6A). Any mRNA expressefrom the disrupted gene will have the extra sequence and th




On thismap, colorsdenotespecific gene functions. Forexample, red genesregulatecellular processes

yellow genesregulatereplication

and green genesregulatethe production of the cell wall.

17.5 Functional Organization of the Genome ofH. influenzaeThe entire DNA sequence has 1,830,137 base pairs. Different colorsreflect different classes of gene function.

If a transposelement iscoand insertedthe middle ofanother geneoriginal genetranscribed ian altered m

A transposon consistsof two transposableelementsflanking another gene or genes. Theentire transposon iscopied and inserted asa unit.

(A)

(B)

Transposon

Other genes

Copying and insertion

DNA

mRNA

A B C D E

A B C D E

F

AlteredmRNA

Transposable element

Transposableelement

Transposableelement

17.6 DNA Sequences that Move Transposable elements are DNA sequence

that move from one location to another. (A) In one method of transposition, theDNA sequence is replicated and the copy inserts elsewhere in the genome. (B)

Transposons contain transposable elements and other genes.


9/21

protein will be abnormal. So transposable elements can producesignificant phenotypic effects by inactivating genes.

Transposable elements are often short sequences of 1,0002,000base pairs, and are found at many sites in prokaryotic genomes.The mechanisms that allow them to move vary. For example, atransposable element may be replicated, and then the copy in-serted into another site in the genome. Or the element mightsplice out of one location and move to another location.

Longer transposable elements (up to 5,000 bp) carry addi-tional genes and are called transposons (Figure 17.6B). Some-times these DNAregions contain a gene for antibiotic resistance.

The sequencing of prokaryotic and viral genomes

has many potential benefits

Prokaryotic genome sequencing promises to provide insightsinto microorganisms that cause human diseases. Genome se-quencing has revealed unknown genes and proteins that can betargeted for isolation and functional study. Such studies arerevealing new methods to combat pathogens and their infec-tions. Sequencing has also revealed surprising relationships be-

tween some pathogenic organisms, suggesting that genes maybe transferred between different strains.

Chlamydia trachomatis causes the most common sexuallytransmitted disease in the United States. Because it is an in-tracellular parasite, it has been very hard to study. Amongits 900 genes are several for ATP synthesissomething sci-entists used to think this bacterium could not accomplishon its own.

Rickettsia prowazekii causes typhus; it is carried by lice andinfects people bitten by the lice. Of its 634 genes, 6 encodeproteins that are essential for virulence. These virulenceproteins are being used to develop vaccines.

Mycobacterium tuberculosis causes tuberculosis. It has a rela-tively large genome, coding for 4,000 proteins. Over 250 ofthese are used to metabolize lipids, so this may be the mainway that this bacterium gets its energy. Some of its genescode for previously unidentified cell surface proteins; theseproteins are targets for potential vaccines.

Streptomyces coelicolor and its close relatives are the sourcefor the genes for two-thirds of all naturally occuring anti-

biotics currently in clinical use. These antibiotics includestreptomycin, tetracycline, and erythromycin. The genomesequence of S. coelicolor reveals 22 clusters of genes respon-sible for antibiotic production, of which only four were pre-viously known. This finding may lead to new antibiotics to

combat pathogens that have evolved resistance to conven-tional antibiotics.

E. coli strain O157:H7 causes illness (sometimes severe) in atleast 70,000 people a year in the United States. Its genomehas 5,416 genes, of which 1,387 are different from those inthe familiar (and harmless) laboratory strains of this bac-terium. Many of these unique genes are also present inother pathogenic bacteria, such as Salmonella and Shigella.This finding suggests that there is extensive genetic ex-

change among these species, and that superbugs tshare genes for antibiotic resistance may be on the h

Severe acute respiratory syndrome (SARS) was first detin southern China in 2002 and rapidly spread in 200There is no effective treatment and 10 percent of infpeople die. Isolation of the causative agent, a virus, the rapid sequencing of its genome revealed several

proteins that are possible targets for antiviral drugs cines. Research is underway on both fronts, since anoutbreak is anticipated.

Genome sequencing also provides insights into organvolved in global ecological cycles (see Chapter 58). In ato the well-known carbon dioxide, another important gtributing to the atmospheric greenhouse effect andwarming is methane (CH4; see Figure 2.7). Some bacterasMethanococcus, produce methane in the stomachs oOthers, such asMethylococcus, remove methane from theuse it as an energy source. The genomes of both of thesria have been sequenced. Understanding the genes invomethane production and oxidation may help us to s

progress of global warming.

Metagenomics allows us to describe new organis

and ecosystems

If you take a microbiology laboratory course you will leato identify various prokaryotes on the basis of their grlab cultures. For example, staphylococci are a group of bthat infect skin and nasal passages. When grown on amedium called blood agar they form round, raised cMicroorganisms can also be identified by their nutritiquirements or the conditions under which they will grexample, aerobic versus anaerobic). Such culture metho

been the mainstay of microbial identification for over aand are still useful and important. However, scientists cuse PCR and modern DNA analysis techniques to analcrobes without culturing them in the laboratory.

In 1985, Norman Pace, then at Indiana University, cwith the idea of isolating DNA directly from environsamples. He used PCR to amplify specific sequences frsamples to determine whether particular microbes weent. The PCR products were sequenced to explore theisity. The term metagenomics was coined to describe proach of analyzing genes without isolating the intact orIt is now possible to perform shotgun sequencing with sfrom almost any environment. The sequences can be usetect the presence of known microbes and pathogens, a

haps even the presence of heretofore unidentified org(Figure 17.7). For example:

Shotgun sequencing of DNA from 200 liters of seawindicated that it contained 5,000 different viruses andifferent bacteria, many of which had not been descpreviously.

One kilogram of marine sediment contained a millioferent viruses, most of them new.

17.2 | WHAT HAVE WE LEARNED FROM SEQUENCING PROKARYOTIC GENOMES?




10/21

Water runoff from a mine contained many new species ofprokaryotes thriving in this apparently inhospitable envi-ronment. Some of these organisms exhibited metabolicpathways that were previously unknown to biologists.These organisms and their capabilities may be useful incleaning up pollutants from the water.

These and other discoveries are truly extraordinary and po-tentially very important. It is estimated that 90 percent of themicrobial world has been invisible to biologists and is only now

being revealed by metagenomics. Entirely new ecosystems ofbacteria and viruses are being discovered in which, for exam-ple, one species produces a molecule that another metabolizes.It is hard to overemphasize the importance of such an increasein our knowledge of the hidden world of microbes. This knowl-edge will help us to understand natural ecological processes,and has the potential to help us find better ways to manageenvironmental catastrophes such as oil spills, or remove toxicheavy metals from soil.

Will defining the genes required for cellular lifelead to artificial life?

When the genomes of prokaryotes and eukaryotes are com-pared, a striking conclusion arises: certain genes are present inall organisms (universal genes). There are also some (nearly)universal gene segments that are present in many genes in manyorganisms; for example, the sequence that codes for an ATP

binding site. These findings suggest that there is some ancient,minimal set of DNAsequences common to all cells. One way toidentify these sequences is to look for them in computer analy-ses of sequenced genomes.

Another way to define the minimal genome is to take aorganism with a simple genome and deliberately mutate ongene at a time to see what happens.M. genitalium has one of thsmallest known genomesonly 482 protein-coding genes. Eveso, some of its genes are dispensable under some circumstanceFor example, it has genes for metabolizing both glucose anfructose, but it can survive in the laboratory on a medium containing only one of these sugars.

What about other genes? Researchers have addressed thquestion with experiments involving the use of transposons mutagens. When transposons in the bacterium are activatethey insert themselves into genes at random, mutating and iactivating them (Figure 17.8). The mutated bacteria are testefor growth and survival, and DNA from interesting mutantssequenced to find out which genes contain transposons.

The astonishing result of these studies is that M. genitaliucan survive in the laboratory with a minimal genome of on382 functional genes! Is this really all it takes to make a viaborganism? Experiments are underway to make a synthetgenome based on that ofM. genitalium, and then insert it inan empty bacterial cell. If the cell starts transcribing mRNAan

making proteinsis in fact viableit may turn out to be thfirst life created by humans.

In addition to the technical feat of creating artificial life, thtechnique could have important applications. New microbcould be made with entirely new abilities, such as degradinoil spills, making synthetic fibers, reducing tooth decay, or coverting cellulose to ethanol for use as fuel. On the other hanfears of the misuse or mishandling of this knowledge are nunfounded. For example, it might also be possible to develosynthetic bacteria harmful to people, animals or plants, and uthem as agents of biological warfare or bioterrorism. The gnomics genie is, for better or worse, already out of the bottlHopefully human societies will use it to their benefit.

17.2 RECAP

DNA sequencing is used to study the genomes ofprokaryotes that are important to humans and toecosystems. Functional genomics uses gene se-quences to determine the functions of the geneproducts. Comparative genomics compares genesequences from different organisms to help identifytheir functions and evolutionary relationships. Trans-posable elements and transposons move from oneplace to another in the genome.

Give some examples of prokaryotic genomes that

have been sequenced. What have the sequencesshown? See pp. 371373

What is metagenomics and how is it used? See pp.373374 and Figure 17.7

How are selective inactivation studies being used todetermine the minimal genome? See p. 374 andFigure 17.8




1 DNA isisolated fromthe environment.

2 DNA isfragmentedand inserted into acloning vector.

3 Clonesare amplifiedand inserted intoE. colito make alibrary.

MetagenomicDNA fragment

DNA and proteinanalysis

Vector

17.7 Metagenomics Microbial DNA extracted from the environment

can be amplified and analyzed. This has led to the description of manynew genes and species.


11/21

Advances in DNAsequencing and analysis have led to the rapidsequencing of eukaryotic genomes. We now turn to the resultsof these analyses.

What Have We Learned from

17.3Sequencing Eukaryotic Genomes?

As genomes have been sequenced and described, a number ofmajor differences have emerged between eukaryotic andprokaryotic genomes (Table 17.1). Key differences include:

Eukaryotic genomes are larger than those of prokaryotes, andthey have more protein-coding genes. This difference is notsurprising, given that multicellular organisms have manycell types with specific functions. Many proteins are needed

to do those specialized jobs. A typicacontains enough DNA to code for onfew proteinsabout 10,000 base paiAs we saw above, the simplest prok

Mycoplasma, has several hundred prcoding genes in a genome of 0.5 milA rice plant, in contrast, has 37,544 g

Eukaryotic genomes have more regulatoquencesand many more regulatoryteinsthan prokaryotic genomes. Thgreater complexity of eukaryotes reqmuch more regulation, which is evidthe many points of control associatethe expression of eukaryotic genes (ure 16.13).

Much of eukaryotic DNA is noncoding. uted throughout many eukaryotic geare various kinds of DNA sequences not transcribed into mRNA, most nointrons and gene control sequences. Adiscuss in Chapter 16, some noncodinquences are transcribed into microRNaddition, eukaryotic genomes containous kinds of repeated sequences. Thetures are rare in prokaryotes.

Eukaryotes have multiple chromosomes.genomic encyclopedia of a eukaryseparated into multiple volumes. Echromosome must have, at a minimuthree defining DNAsequences that wdescribed in previous chapters: an orreplication (ori) that is recognized byDNA replication machinery; a centroregion that holds the replicated chrom

somes together before mitosis; and americ sequence at each end of the chsome that maintains chromosome in

Model organisms reveal manycharacteristics of eukaryotic genome

Most of the lessons learned from eukgenomes have come from several simple model organishave been studied extensively: the yeast Saccharomyces cethe nematode (roundworm) Caenorhabditis elegans, the Drosophila melanogaster, andrepresenting plantsthcress,Arabidopsis thaliana. Model organisms have been

because they are relatively easy to grow and study in atory, their genetics are well studied, and they exhibit chistics that represent a larger group of organisms.

YEAST: THE BASIC EUKARYOTIC MODEL Yeasts are singleukaryotes. Like most eukaryotes, they have membrclosed organelles, such as the nucleus and endoplasmiclum, and a life cycle that alternates between haploid andgenerations (see Figure 11.15).

17.3 | WHAT HAVE WE LEARNED FROM SEQUENCING EUKARYOTIC GENOMES?



HYPOTHESIS Only some of the genesin a bacterial genome are essential forcell survival.

INVESTIGATING LIFE

METHOD

RESULTS

17.8 Using Transposon Mutagenesisto Determine the Minimal Genome

Mycoplasma genitaliumhasthe smallest number of genesof any prokaryote.But are all of itsgenesessential to life? By inactivating the genesone by one,

scientistsdetermined which of them are essential for the cellssurvival. Thisresearch may lead to the construction of artificial cellswith customized genomes,

designed to perform functionssuch asdegrading oil and making plastics.

CONCLUSION If each gene isinactivated in turn, a "minimal essentialgenome" can be determined.

Go to yourBioPortal.comfor original citations, discussions, andrelevant links for all INVESTIGATING LIFE figures.

M. genitaliumhas482genes; only two areshown here.

A transposon insertsrandomly into onegene, inactivating it.

Each mutant isputinto growth medium.

Growth meansthatgene A isnot essential.

No growth meansthatgene B isessential.

Experiment 1 Experiment 2

A B

Inactivegene B

Inactive

gene A


12/21

While the prokaryoteE. coli has a single circular chromosomewith about 4.6 million bp and 4,290 protein-coding genes, bud-ding yeast (Saccharomyces cerevisiae) has 16 linear chromosomesand a haploid content of more than 12.5 million bp, with 5,770protein-coding genes. Gene inactivation studies similar to those

carried out forM. genitalium (see Figure 17.7) indicate that fewerthan 20 percent of these genes are essential to survival.

The most striking difference between the yeast genome andthat of E. coli is in the number of genes for targeting proteins toorganelles (Table 17.2). Both of these single-celled organisms ap-pear to use about the same numbers of genes to perform the

basic functions of cell survival. It is the compart-mentalization of the eukaryotic yeast cell into or-ganelles that requires it to have many more genes.This finding is direct, quantitative confirmation ofsomething we have known for a century: the eu-karyotic cell is structurally more complex than theprokaryotic cell.

THE NEMATODE: UNDERSTANDING EUKARYOTIC DE-

VELOPMENT In 1965 Sydney Brenner, fresh frombeing part of the team that first isolated mRNA,looked for a simple organism in which to studymulticellularity. He settled on Caenorhabditis ele-

gans, a millimeter-long nematode (roundworm)that normally lives in the soil. It can also live inthe laboratory, where it has become a favorite

model organism of developmental biologists (see Section 19.4The nematode has a transparent body that develops over 3 dafrom a fertilized egg to an adult worm made up of nearly 1,00cells. In spite of its small number of cells, the nematode hasnervous system, digests food, reproduces sexually, and ages. Sit is not surprising that an intense effort was made to sequenthe genome of this model organism.

The C. elegans genome (100 million bp) is eight times larger thathat of yeast and has 3.5 times as many protein-coding gen(19,427). Gene inactivation studies have shown that the worm casurvive in laboratory cultures with only 10 percent of these geneSo the minimum genome of a worm is about twice the size that of yeast, which in turn is four times the size of the minimum

genome forMycoplasma. What do these extra genes do?All cells must have genes for survival, growth, and divisio

In addition, the cells of multicellular organisms must have genfor holding cells together to form tissues, for cell differentiatioand for intercellular communication. Looking at Table 17.3, yowill recognize functions that we discussed in earlier chapter


TABLE 17.1

Representative Sequenced Genomes

HAPLOID PROTEIN-

GENOME NUMBER CODING

ORGANISM SIZE (Mb) OF GENES SEQUENCE

Bacteria

M. genitalium 0.58 485 88%H. influenzae 1.8 1,738 89%

E. coli 4.6 4,377 88%

Yeasts

S. cerevisiae 12.5 5,770 70%

S. pombe 12.5 4,929 60%

Plants

A. thaliana 115 28,000 25%

Rice 390 37,544 12%

Animals

C. elegans 100 19,427 25%

D. melanogaster 123 13,379 13%

Pufferfish 342 27,918 10%

Chicken 1,130 25,000 3%

Human 3,300 24,000 1.2%

Mb = millions of base pairs

TABLE 17.2

Comparison of the Genomes ofE. coliand Yeast

E. COLI YEAST

Genome length (base pairs) 4,640,000 12,068,000

Number of protein-coding genes 4,290 5,770

Proteins with roles in:

Metabolism 650 650

Energy production/storage 240 175

Membrane transport 280 250

DNA replication/repair/ 120 175recombination

Transcription 230 400

Translation 180 350

Protein targeting/secretion 35 430

Cell structure 180 250

TABLE 17.3

C. elegans Genes Essential to Multicellularity

FUNCTION PROTEIN/DOMAIN NUMBER OF GENES

Transcription control Zinc finger; homeobox 540

RNA processing RNA binding domains 100

Nerve impulse transmission Gated ion channels 80

Tissue formation Collagens 170

Cell interactions Extracellular domains; 330glycotransferases

Cellcel l signaling G protein-linked receptors; 1,290protein kinases; proteinphosphatases




13/21

including gene regulation (see Chapter 16) and cell communi-cation (see Chapter 7).

DROSOPHILA MELANOGASTER: RELATING GENETICS TO GENOMICS

The fruit fly Drosophila melanogaster is a famous model organ-ism. Studies of fruit fly genetics resulted in the formulation ofmany basic principles of genetics (see Section 12.4). Over 2,500mutations of D. melanogaster had been described by the 1990s

when genome sequencing began, and this fact alone was a goodreason for sequencing the fruit flys DNA. The fruit fly is a muchlarger organism than C. elegans, both in size (it has 10 times morecells) and complexity, and it undergoes complicated develop-mental transformations from egg to larva to pupa to adult.

Not surprisingly, the flys genome (about 123 million bp) islarger than that of C. elegans. But as we mentioned earlier,genome size does not necessarily correlate with the number ofgenes encoded. In this case, the larger fruit fly genome containsfewer genes (13,379) than the smaller nematode genome. Fig-ure 17.9 summarizes the functions of the Drosophila genes thathave been characterized so far; this distribution is typical ofcomplex eukaryotes.

ARABIDOPSIS: STUDYING THE GENOMES OF PLANTS About 250,000species of flowering plants dominate the land and fresh water.But in the context of the history of life, the flowering plants arefairly young, having evolved only about 200 million years ago.The genomes of some plants are hugefor example, the genomeof corn is about 3 billion bp, and that of wheat is 16 billion bp.So although we are naturally most interested in the genomes ofplants we use as food and fiber, it is not surprising that scientistsfirst chose to sequence a simpler flowering plant.

Arabidopsis thaliana, thale cress, is a member of the mustardfamily and has long been a favorite model organism of plant bi-ologists. It is small (hundreds could grow and reproduce in thespace occupied by this page) and easy to manipulate, and hasa relatively small (115 million bp) genome.

TheArabidopsis genome has about 28,000 protein-genes but, remarkably, many of these genes are duplicaprobably originated by chromosomal rearrangementsthese duplicate genes are subtracted from the total, abouunique genes are leftsimilar to the gene numbers fofruit flies and nematodes. Indeed, many of the genes fthese animals have homologs (genes with very simquences) inArabidopsis and other plants, suggesting thaand animals have a common ancestor.

ButArabidopsis has some genes that distinguish it as

(Table 17.4). These include genes involved in photosyin the transport of water into the root and throughout thin the assembly of the cell wall, in the uptake and metof inorganic substances from the environment, and in tthesis of specific molecules used for defense against mand herbivores (organisms that eat plants). These plant molecules may be a major reason why the number of pcoding genes in plants is higher than in animals. Plantsescape their enemies or other adverse conditions as animand so they must cope with the situation where they are.make tens of thousands of molecules to fight their enemadapt to the environment (see Chapter 39).

These plant-specific genes are also found in the genother plants, including rice, the first major crop plant wquence has been determined. Rice (Oryza sativa) is the most important crop; it is a staple in the diet of 3 billionThe larger genome in rice has a set of genes remarkablyto that ofArabidopsis. More recently the genome of thetree, Populus trichocarpa, was sequenced to gain insigthe potential for this rapidly growing tree to be used as aof fixed carbon for making fuel. A comparison of thgenomes shows many genes in common, comprising t

plant genome (Figure 17.10).

Eukaryotes have gene families

About half of all eukaryotic protein-coding genes exist

one copy in the haploid genome (two copies in somatThe rest are present in multiple copies, which arose froduplications. Over evolutionary time, different copies ohave undergone separate mutations, giving rise to grclosely related genes called gene families. Some gene fsuch as those encoding the globin proteins that make upglobin, contain only a few members; other families, the genes encoding the immunoglobulins that make u

bodies, have hundreds of members. In the human g




17.9 Functions of the Eukaryotic Genome The distribution of gene

functions in Drosophila melanogastershows a pattern that is typical of

many complex organisms.

25%

20%

20%

15%

10%

10%

Cellsignaling

Cell cycle

Cell structureand organelles

Membrane structureand transport

Enzymesfor generalmetabolism

DNA replication,maintenance,

and expression

TABLE 17.4

Arabidopsis Genes Unique to Plants

FUNCTION NUMBER OF GENES

Cell wall and growth 42

Water channels 300

Photosynthesis 139

Defense and metabolism 94


14/21

there are 24,000 protein-coding genes, but 16,000 distinct genefamilies. So only one-third of the human genes are unique.

The DNA sequences in a gene family are usually differentfrom one another. As long as at least one member encodes afunctional protein, the other members may mutate in ways thatchange the functions of the proteins they encode. For evolution,the availability of multiple copies of a gene allows for selec-tion of mutations that provide advantages under certain circum-stances. If a mutated gene is useful, it may be selected for in suc-ceeding generations. If the mutated gene is a total loss, thefunctional copy is still there to carry out its role.

The gene family encoding the globins is a good example ofthe gene families found in vertebrates. These proteins are foundin hemoglobin and myoglobin (an oxygen-binding protein pres-ent in muscle). The globin genes all arose long ago from a singlecommon ancestral gene. In humans, there are three functionalmembers of the -globin cluster and five in the -globin cluster(Figure 17.11). In adults, each hemoglobin molecule is a tetramercontaining two identical -globin subunits, two identical -glo-

bin subunits, and four heme pigments (see Figure 3.10).During human development, different members of the globin

gene cluster are expressed at different times and in different tis-sues. This differential gene expression has great physiological sig-nificance. For example, hemoglobin containing -globin, a sub-unit found in the hemoglobin of the human fetus, binds O2 moretightly than adult hemoglobin does. This specialized form of

hemoglobin ensures that in the placenta, O2 will be transferredfrom the mothers blood to the developing fetuss blood. Just be-fore birth the liver stops synthesizing fetal hemoglobin and the

bone marrow cells take over, making the adult forms (2 and 2).Thus hemoglobins with different binding affinities for O2 are pro-vided at different stages of human development.

In addition to genes that encode proteins, many gene fami-lies include nonfunctional pseudogenes, which are designatedwith the Greek letter psi () (see Figure 17.11). These pseudo-

genes result from mutations that cause a loss of function raththan an enhanced or new function. The DNA sequence ofpseudogene may not differ greatly from that of other famimembers. It may simply lack a promoter, for example, and thufail to be transcribed. Or it may lack a recognition site neede

for the removal of an intron, so that the transcript it makes not correctly processed into a useful mature mRNA. In somgene families pseudogenes outnumber functional genes. Bcause some members of the family are functional, there appeato be little selection pressure for the deletion of pseudogenes

Eukaryotic genomes contain many repetitive sequence

Eukaryotic genomes contain numerous repetitive DNA squences that do not code for polypeptides. These include highrepetitive sequences, moderately repetitive sequences, antransposons.

Highly repetitive sequences are short (less than 100 bp) squences that are repeated thousands of times in tandem (side-bside) arrangements in the genome. They are not transcribed. Theproportion in eukaryotic genomes varies, from 10 percent humans to about half the genome in some species of fruit flieOften they are associated with heterochromatin, the densepacked, transcriptionally inactive part of the genome. Othhighly repetitive sequences are scattered around the genome. Fexample, short tandem repeats (STRs) of 15 bp can be repeateup to 100 times at a particular chromosomal location. The copnumber of an STR at a particular location varies between indviduals and is inherited. In Chapter 15 we describe how STRs ca

be used in the identification of individuals (DNAfingerprintingModerately repetitive sequences are repeated 101000 tim

in the eukaryotic genome. These sequences include the gen

that are transcribed to produce tRNAs and rRNAs, which aused in protein synthesis. The cell makes tRNAs and rRNAconstantly, but even at the maximum rate of transcription, sigle copies of the tRNA and rRNA genes would be inadequato supply the large amounts of these molecules needed by mocells. Thus the genome has multiple copies of these genes.

In mammals, four different rRNAmolecules make up the bosome: the 18S, 5.8S, 28S, and 5S rRNAs. (The S stands foSvedberg unit, which is a measure of size.) The 18S, 5.8S, an




These genesareunique toArabidopsis.

These genesareunique to poplar.

These genesare shared byall three plant genomes.

These genesareunique to rice.

21,000

4,000

Arabidopsisthaliana

1,600 75

6009,000 18,000

Populus

trichocarpa Oryzasativa

Noncoding spacer DNA isfoundbetween gene family members.

b-globingene cluster

a-globingene cluster

G A 1

12112

Nonfunctionalpseudogenes

DNA

Chromosome

Chromosome

17.10 Plant Genomes Three plant genomes share a common set ofapproximately 21,000 genes that appear to comprise the minimalplant

genome.

17.11 The Globin Gene Family The -globin and -globin clusters othe human globin gene family are located on different chromosomes. Th

genes of each cluster are separated by noncoding spacerDNA. The

nonfunctional pseudogenes are indicated by the Greekletter psi (). Thgene has two variants, Aand G.


15/21

28S rRNAs are transcribed together as a single precursor RNAmolecule (Figure 17.12). As a result of several posttranscrip-tional steps, the precursor is cut into the final three rRNAprod-ucts, and the noncoding spacer RNA is discarded. The se-quence encoding these RNAs is moderately repetitive inhumans: a total of 280 copies of the sequence are located in clus-ters on five different chromosomes.

TRANSPOSONS Apart from the RNA genes, most moderatelyrepetitive sequences are not stably integrated into the genome.Instead, these sequences can move from place to place, andare thus called transposable elements or transposons. Prokary-otes also have transposons (see Figure 17.6). Transposons makeup over 40 percent of the human genome and about 50 per-cent of the maize genome, although the percentage is smaller(310 percent) in many other eukaryotes.

There are four main types of transposons in eukaryotes:

1. SINEs (short interspersed elements) are up to 500 bp long andare transcribed but not translated. There are about 1.5 millionof them scattered over the human genome, making up about15 percent of the total DNAcontent. Asingle type, the 300-bp

Alu element, accounts for 11 percent of the human genome;it is present in a million copies.

2. LINEs (long interspersed elements) are up to 7,000 bp long,and some are transcribed and translated into proteins. Theyconstitute about 17 percent of the human genome.

SINEs and LINEs move about the genome in a distinway: they are transcribed into RNA, which then actstemplate for new DNA. The new DNA becomes insea new location in the genome. This copy and pasteanism results in two copies of the transposon: one atoriginal location and the other at a new location.

3. Retrotransposons also make RNA copies of themselvethey move about the genome. Some of them encode p

needed for their own transposition, and others do noand LINEs are types of retrotransposons. Non-SINLINE retrotransposons constitute about 8 percent ofman genome.

4. DNA transposons do not use RNA intermediates. Likprokaryotic transposable elements, they are excised foriginal location and become inserted at a new locatioout being replicated.

What role do these moving sequences play in the cbest answer so far seems to be that transposons are simlular parasites that can be replicated. The insertion of poson at a new location can have important consequen

example, the insertion of a transposon into the coding rea gene results in a mutation (see Figure 17.8). This phenoaccounts for a few rare forms of several genetic diseasemans, including hemophilia and muscular dystrophy. Isertion of a transposon takes place in the germ line, a with a new mutation results. If the insertion takes placematic cell, cancer may result.

Sometimes an adjacent gene can bcated along with a transposon, resultgene duplication. Atransposon can carryor a part of it, to a new location in the gshuffling the genetic material and creatigenes. Clearly, transposition stirs the genin the eukaryotic genome and thus con

to genetic variation.Section 5.5 describes the theory of e

biosis, which proposes that chloroplasts tochondria are the descendants of once ing prokaryotes. Transposons may have prole in endosymbiosis. In living eukarychloroplasts and mitochondria contaiDNA, but the nucleus contains most of th




ThisrRNA gene isrepeated manytimes(280 in humans).

Transcriptionbeginshere

the RNAelongates

and elongatesuntilit isreleased here.

Many rRNA precursorsare beingtranscribed from multiple rRNA genes.

Processing stepsremove the spacerswithin the transcribedregion.

Pre-rRNAtranscript

rRNAs

18S 5.8S 28S

18S 5.8S 28S

13,000 bpTranscribed region

30,000 bp

Nontranscribedspacer region

DNA

(A)

(B)

Strandsof rRNA

DNA

17.12 A Moderate

Repetitive SequenCodes for rRNA

rRNA gene, along wnontranscribed spac

region, is repeated 2

in the human genomclusters on five chro

somes. (B) This elecmicrograph shows t

tion of multiple rRNA


16/21

that encode the organelles proteins. If the organelleswere once independent, they must originally have con-tained all of those genes. How did the genes move to thenucleus? They may have done so by DNAtranspositions

between organelles and the nucleus, which still occur to-day. The DNAthat remains in the organelles may be theremnants of more complete prokaryotic genomes.

See Figure 17.13 for a summary of the various typesof sequences in the human genome.

17.3 RECAP

The sequencing of the genomes of model organismsdemonstrated common features of the eukaryoticgenome, including the presence of repetitive se-quences and transposons. Some eukaryotic genesare in families, which may include members that are

mutated and nonfunctional. Some sequences aretranscribed, but others are not.

What are the major differences between prokaryoticand eukaryotic genomes? See p. 375

Describe one function of genes found in C. elegansthat has no counterpart in the genome of yeast. Seep. 376 and Table 17.3

What is the evolutionary role of eukaryotic gene fami-lies? See p. 377

Why are there multiple copies of sequences coding forrRNA in the mammalian genome? See p. 378

What effects can transposons have on a genome?See p. 379

The analysis of eukaryotic genomes has resulted in an enormousamount of useful information, as we have seen. In the next sec-tion we look more closely at the human genome.

What Are the Characteristics17.4 of the Human Genome?By the start of 2005 the first human genome sequences werecompleted, two years ahead of schedule and well under budget.The published sequences, one produced by the publically

funded Human Genome Project, and the other by a private com-pany, were haploid genomes that were composites of severalpeople. Since 2005, the diploid genomes of several individualshave been sequenced and published.

The human genome sequence held some surprises

The following are just some of the interesting facts that we havelearned about the human genome:

Of the 3.3 billion base pairs in the haploid human genomefewer than 2 percent (about 24,000 genes) make up proteincoding regions. This was a surprise. Before sequencing be-gan, humans were estimated, based on the diversity of theproteins, to have 80,000150,000 genes. The actual numberof genesnot many more than in a fruit flymeans thatposttranscriptional mechanisms (such as alternative splic-ing) must account for the observed number of proteins inhumans. That is, the average human gene must code forseveral different proteins.

The average gene has 27,000 base pairs. Gene sizes varygreatly, from about 1,000 to 2.4 million base pairs. Variatioin gene size is to be expected given that human proteins(and RNAs) vary in size, from 100 to about 5,000 aminoacids per polypeptide chain.

Virtually all human genes have many introns.

Over 50 percent of the genome is made up of transposonsand other highly repetitive sequences. Repetitive sequencenear genes are GC-rich, while those farther away fromgenes are AT-rich.

Most of the genome (about 97 percent) is the same in all people. Despite this apparent homogeneity, there are, of coursemany individual differences. Scientists have mapped over 7million single nucleotide polymorphisms (SNPs) in human

Genes are not evenly distributed over the genome. Chro-

mosome 19 is packed densely with genes, while chromo-some 8 has long stretches without coding regions. The Ychromosome has the fewest genes (231), while chromosom1 has the most (2,968).

Comparisons between sequenced genomes from prokarotes and eukaryotes have revealed some of the evolutionary rlationships between genes. Some genes are present in botprokaryotes and eukaryotes; others are only in eukaryotes; stothers are only in animals, or only in vertebrates (Figure 17.14




Promotersandexpressioncontrol sequences

Introns

Transcribed Translat

Exons

Highly repetitiveshort sequences

rRNA and tRNAgenes

SINEs

LINEs

Retrotransposons

DNA transposons

Transposableelements

Moderatelyrepetitivesequences

Single-copygenes

+

+

+

+

+

+

+

+

+

+

+

17.13 Sequences in the Eukaryotic Genome There are many types of DNA

sequences. Some are transcribed, and some of those sequences are translated.


17/21

More comparative genomics is possible now that thegenomes of two other primates, the chimpanzee and the rhesusmacaque, have been sequenced. The chimpanzee is evolution-arily close to humans, and shares 95 percent of the humangenome sequence. The more distantly related rhesus macaqueshares 91 percent of the human sequence. The search is on fora set of human genes that differ from the other primates andmake humans human.

Human genomics has potential benefits in medicine

Complex phenotypes are determined not by single genes, butby multiple genes interacting with the environment. The single-allele models of phenylketonuria and sickle-cell ane-mia (see Chapter 15) do not apply to such commondisorders as diabetes, heart disease, and Alzheimersdisease. To understand the genetic bases of these dis-eases, biologists are now using rapid genotypingtechnologies to create haplotype maps, which areused to identify SNPs (pronounced snips) that arelinked to genes involved in disease.

HAPLOTYPE MAPPING The SNPs that differ betweenindividuals are not inherited as independent alleles.

Rather, a set of SNPs that are present on a segment ofchromosome are usually inherited as a unit. This linkedpiece of a chromosome is called a haplotype. You canthink of the haplotype as a sentence and the SNP as

a word in the sentence. Analyses of haplotypes in humaall over the world have shown that there are at most common variations.

GENOTYPING TECHNOLOGY AND PERSONAL GENOMICS Nenologies are continually being developed to analyze thoor millions of SNPs in the genomes of individuals. Sunologies include rapid sequencing methods and DNA

arrays that depend on DNA hybridization to identify SNPs. For example, a microarray of 500,000 SNPs hused to analyze thousands of people to find out whicare associated with specific diseases. The aim is to corrSNP-defined haplotype with a disease state. The amountis prodigious: 500,000 SNPs, thousands of people, thoof medical records. With so much natural variation, stmeasures of association between a haplotype and a need to be very rigorous.

These association tests have revealed particular hapor alleles that are associated with modestly increased rsuch diseases as breast cancer, diabetes, arthritis, obescoronary heart disease (Figure 17.15 and Table 17.5).

companies will now scan a human genome for these vaand the price for this service keeps getting lower. Howthis point it is unclear what a person without symptomsdo with the information, since multiple genes, environinfluences, and epigenetic effects all contribute to the dment of these diseases.

Of course, the best way to analyze a persons genomactually sequencing it. Until recently, this was prohibitipensive. As we mentioned earlier, DNApioneer James Wgenome cost over $1 million, certainly too much for a typson or insurance company to afford in the context of heaBut with advances in sequencing technologies the cost is ing rapidly. One new method automatically sequences pcoding exons only, for example. Once the cost of geno

quencing is within an affordable range, SNP testingsuperseded.

17.4 | WHAT ARE THE CHARACTERISTICS OF THE HUMAN GENOME?



Increasingcomplexity

Vertebratesonly22%

All prokaryotesand eukaryotes

21%

Vertebratesand animalsonly

24%Eukaryotesonly

32%

Functionsnecessaryfor life

Cellcompartments

Multicellularity

Development

Immunesystem

Nervoussystem

17.14 Evolution of the Genome A comparison of the human andother genomes has revealed how genes with new functions have been

added over the course of evolution. Each percentage number refers to

genes in the human genome. Thus, 21 percent of human genes havehomologs in prokaryotes and other eukaryotes, 32 percent of human

genes occur only in other eukaryotes, and so on.

Each bar isa SNP. There are manythousandsof SNPsin the human genome.

Comparing the profilesrevealsSNPsthat correlate with disease.

DNA

SNP profile in peoplewith the disease

SNP profile in peoplewithout the disease

17.15 SNP Genotyping and Disease Scanning the genomes of peo-

ple with and without particular diseases reveals correlations between

SNPs and complex diseases.


18/21

PHARMACOGENOMICS Genetic variation can affect how an in-dividual responds to a particular drug. For example, a drug may

be chemically modified in the liver to make it more or less ac-tive. Consider an enzyme that catalyzes the following reaction:

active drug less active drug

A mutation in the gene that encodes this enzyme may make theenzyme less active. For a given dose of the drug, a person withthe mutation would have more active drug in the bloodstreamthan a person without the mutation. So the effective dose of thedrug would be lower in these people.

Now consider a different case, in which the liver enzyme isneeded to make the drug active:

inactive drug active drug

A person carrying a mutation in the gene encoding this liver en-zyme would not be affected by the drug, since the activating en-zyme is not present.

The study of how an individuals genome affects his or hresponse to drugs or other outside agents is called pharmacgenomics. This type of analysis makes it possible to prediwhether a drug will be effective. The objective is topersonalidrug treatment so that a physician can know in advance whethan individual will benefit from a particular drug (Figure 17.16This approach might also be used to reduce the incidence adverse drug reactions by identifying individuals that will m

tabolize a drug slowly, which can lead to a dangerously higlevel of the drug in the body.

17.4 RECAP

The haploid human genome has 3.3billion basepairs, but less than 2 percent of the genome codesfor proteins. Most human genes are subject to alter-native splicing; this may account for the fact thatthere are more proteins than genes. SNP mapping tofind correlations with disease and drug susceptibilityholds promise for personalized medicine.

What are some of the major characteristics of the hu-man genome? See p. 380

How does SNP mapping work in personalized medi-cine? See pp. 381382 and Figures 17.15 and 17.16

Genome sequencing has had great success in advancing bioloical understanding. High-throughput technologies are now bing applied to other components of the cell: proteins anmetabolites. We now turn to the results of these studies.

What Do the New Disciplines17.5 Proteomics and Metabolomics RevealThe human genome is the book of life. Statements like thwere common at the time the human genome sequence was firrevealed. They reflect genetic determinism, that a personphenotype is determined by his or her genotype. But is an oganism just a product of gene expression? We know that it not. The proteins and small molecules present in any cell atgiven point in time reflect not just gene expression but modifcations by the intracellular and extracellular environment. Twnew fields have emerged to complement genomics and takemore complete snapshot of a cell and organismproteomiand metabolomics.

The proteome is more complex than the genome

As mentioned above, many genes encode more than a singprotein (Figure 17.17A). Alternative splicing leads to differecombinations of exons in the mature mRNAs transcribed froa single gene (see Figure 16.22). Posttranslational modificationalso increase the number of proteins that can be derived froone gene (see Figure 14.22). The proteome is the sum total of thproteins produced by an organism, and it is more complex thaits genome.




TABLE 17.5

SNP Human Genome Scans and Diseases

LOCATION OF SNP % INCREASED RISK(CHROMOSOME

DISEASE NUMBER) HETEROZYGOTES HOMOZYGOTES

Breast cancer 8 20 63

Coronary heart

disease 9 20 56

Heart attack 9 25 64

Obesity 16 32 67

Diabetes 10 65 277

Prostate cancer 8 26 58

These patientseither do not respondto the drug or suffer side effects.

They need an alternative drug or dose.These patientshave the genesforan effective response to the drug.

All patientswith the same diagnosis

17.16 Pharmacogenomics Correlations between genotypes and

responses to drugs will help physicians develop personalized medical care.


19/21

Two methods are commonly used to analyze the proteome:

Because of their unique amino acid compositions (primarystructures), most proteins have unique combinations ofelectric charge and size. On the basis of these two proper-

ties, they can be separated by two-dimensional gel elec-trophoresis. Thus isolated, individual proteins can be ana-lyzed, sequenced, and studied (Figure 17.17B).

Mass spectrometry uses electromagnets to identify proteinsby the masses of their atoms and displays them as peaks ona graph.

The ultimate aim of proteomics is just as ambitious as that ofgenomics. While genomics seeks to describe the genome and itsexpression, proteomics seeks to identify and characterize all ofthe expressed proteins.

Comparisons of the proteomes of humans and other eukary-otic organisms have revealed a common set of proteins that can

be categorized into groups with similar amino acid sequences

and similar functions. Forty-six percent of the yeast proteome,43 percent of the worm proteome, and 61 percent of the fly pro-teome are shared by the human proteome. Functional analy-ses indicate that this set of 1,300 proteins provide the basic meta-

bolic functions of a eukaryotic cell, such as glycolysis, the citricacid cycle, membrane transport, protein synthesis, DNA repli-cation, and so on. (Figure 17.18).

Of course, these are not the only human proteins. There aremany more, which presumably distinguish us as human eukary-

otic organisms. As we have mentioned before, proteins hferent functional regions called domains (for example, a for binding a substrate, or a domain for spanning a mem

While a particular organism may have many unique pthose proteins are often just unique combinations of dthat exist in other organisms. This reshuffling of the geneta key to evolution.

Metabolomics is the study of chemical phenotype

Studying genes and proteins gives a limited picture of going on in a cell. But as we have seen, both gene functprotein function are affected by the internal and externronments of the cell. Many proteins are enzymes and ttivities affect the concentrations of their substrates anucts. So as the proteome changes, so will the abundathese often-small molecules, called metabolites. The mlome is the quantitative description of all of the small min a cell or organism. These include:

Primary metabolites involved in normal processes, sucintermediates in pathways like glycolysis. This categalso includes hormones and other signaling molecul

Secondary metabolites, which are often unique to partiorganisms or groups of organisms. They are often inin special responses to the environment. Examples atibiotics made by microbes, and the many chemicals

by plants that are used in defense against pathogensherbivores.

Not surprisingly, measuring metabolites involves s

cated analytical instruments. If you have studied organiclytical chemistry, you may be familiar with gas chromphy and high-performance liquid chromatographyseparate molecules, and mass spectrometry and nuclenetic resonance spectroscopy, which are used to identifThese measurements result in chemical snapshots oforganisms, which can be related to physiological states

There has been some progress in defining the metabolome. A database created by David Wishart a

17.5 | WHAT DO THE NEW DISCIPLINES PROTEOMICS AND METABOLOMICS REVEAL?



1 Alternative splicingcan producedifferent mRNAs

2 that get

trans

lated intodifferent proteins.

3 Posttranslational modificationsof proteinsresult in differentstructuresand functions.

Thisgel separateshundredsof proteinsin two dimensions.

A protein can beisolated, sequenced,

and studied.

P

mRNA

exon intron

Second

separation

(size)

First separation (charge)

DNA

(A)

(B)

Proteinkinase

17.17 Proteomics (A) A single gene can code for multiple proteins.(B) A cells proteins can be separated on the basis of charge and size by

two-dimensional gel electrophoresis. The two separations can distinguish

most proteins from one another.

Transcriptionand translation

Metabolism

Transport

Replication

Proteinfolding

Miscellaneous

17.18 Proteins of the Eukaryotic Proteome About 1,300 pare common to all eukaryotes and fall into these categories. Altho

their amino acid sequences may differ to a limited extent, they pethe same essential functions in all eukaryotes.


20/21




17.1 How Are Genomes Sequenced?

The sequencing of genomes required the development of waysto cut large chromosomes into fragments, sequence each of thefragments, and then line them up on the chromosome. ReviewFigure 17.1, ANIMATED TUTORIAL 17.1

Hierarchical sequencing involves mapping the genome withgenetic markers, cutting the genome into smaller pieces andsequencing them, then lining up the sequences using themarkers.

Shotgun sequencing involves directly cutting the genome intooverlapping fragments, sequencing them, and using a comput-er to line up the sequences.

DNA sequencing technologies involve labeled nucleotidesthat terminate the growing polynucleotide chain. ReviewFigure 17.2

Rapid, automated methods for high-throughput sequencingare being developed. Review Figure 17.3, ANIMATEDTUTORIAL 17.2

17.2 What Have We Learned from SequencingProkaryotic Genomes?

DNA sequencing is used to study the genomes of prokaryotesthat are important to humans and ecosystems.

Functional genomics aims to determine the functions of geneproducts. Comparative genomics involves comparisons ofgenes and genomes from different organisms to identify com-mon features and functions.

Transposable elements and transposons can move about thegenome. Review Figure 17.6

Metagenomics is the identification of DNA sequences without

first isolating, growing and identifying the organisms present inan environmental sample. Many of these sequences are fromprokaryotes that were heretofore unknown to biologists.Review Figure 17.7

Transposon mutagenesis can be used to inactivate genes oneby one. Then the organism can be tested for survival. In thisway, a minimal genome of less than 350 genes was identifiedfor the bacterium Mycoplasma genitalium. Review Figure 17.8

17.3 What Have We Learned from Sequencing

Eukaryotic Genomes?

Genome sequences from model organisms have demonstratedsome common features of the eukaryotic genome. In addition,there are specialized genes for cellular compartmentation,development, and features unique to plants. Review Tables17.117.4 and Figures 17.9 and 17.10

Some eukaryotic genes exist as members of gene families.Proteins may be made from these closely related genes at dif-ferent times and in different tissues. Some members of genefamilies may be nonfunctional pseudogenes.

Repeated sequences are present in the eukaryotic genome.

Moderately repeated sequences include those coding forrRNA. Review Figure 17.12

17.4 What Are the Characteristics of the Human Genome?

The haploid human genome has 3.3 billion base pairs.

Only 2 percent of the genome codes for proteins; the rest con-sists of repeated sequences and noncoding DNA.

Virtually all human genes have introns, and alternative splicingleads to the production of more than one protein per gene.

SNP genotyping correlates variations in the genome with dis-eases or drug sensitivity. It may lead to personalized medicine.Review Figure 17.15

Pharmacogenomics is the analysis of genetics as applied todrug metabolism.

17.5 What Do the New Disciplines of Proteomics andMetabolomics Reveal?

The proteome is the total protein content of an organism.

There are more proteins than protein-coding genes in the genome.

The proteome can be analyzed using chemical methods thatseparate and identify proteins. These include two-dimensionalelectrophoresis and mass spectrometry. See Figure 17.17

The metabolome is the total content of small molecules, suchas intermediates in metabolism, hormones, and secondarymetabolites.

SEE WEB ACTIVITY 17.1 for a concept review of this chapter.

CHAPT

Documents

The Dog Genome