137
Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Embed Size (px)

Citation preview

Page 1: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Eukaryotic genomes

Genomics260.605.01J. Pevsner

November 17, 2010

Page 2: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Problems from Chapter 16:Human -satellite DNA at centromeres

Approach: blastn X07685……vs. human genomic + transcript

…vs. nucleotide collection

…vs. nucleotide collection with organism set to human exclude

Page 3: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Giardia

Trypanosoma

Page 4: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Eukaryotes(Baldauf et al. 2000)

Page 730

Page 5: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Outline for today

Page 6: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Outline: Chapter 18

Protozoans (Giardia lamblia)Pathogens: Trypanosomes and LeishmaniaChromalveolates: Plasmodium, ParameciumPlants: Green algae, Arabidopsis, rice, poplarSlime mold DictyosteliumMetazoans

Nematode C. elegansInsects: Drosophila, Anopheles gambiaeSea urchin750 MYA: Ciona450 MYA: fish310 MYA: dinosaurs and chicken180 MYA: opposum100 MYA: mammalian radiation80 MYA: rodents5-50 MYA: primates

Page 729

Page 7: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

TrichomonasGiardia

Page 8: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Parabasala

It was previously postulated that the Parabasala lack mitochondria; they are deep-branching eukaryotes that may have separated before the alpha-proteobacterial endosymbiont entered the eukaryotic lineage. Recent studies suggest the Parabasala do indeed have mitochondrial genes (or their remnants).

Page 732

Page 9: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Trichomonas vaginalis

• Flagellated protist, single-celled organism

• 170 million cases world-wide

• 160 Mb genome

• 62% repetitive DNA (many repeats of viral origin such as transposons and retrotransposons)

• 60,000 predicted protein-coding genes (one of the highest known totals)• Massive expansion of selected families such as protein kinases (n=927) and the BspA-like gene family (n=658; involved in host cell adherence and aggregation)• 152 genes acquired by lateral gene transfer, presumably from intestinal bacteria

Page 732

Page 10: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Giardia lamblia

• Also called Giardia intestinalis• Water-borne parasite, lives in intestines of mammals and birds• Causes giardiasis (source of nonbacterial diarrhea)• Lacks both mitochondria and peroxisomes• Genome 11.7 Mb• Each cell has two nuclei, each with five chromosomes (range 0.7 to 3 megabases)• ~6,500 ORFs (~1,800 overlapping genes)• Only three retrotransposon families• Only four introns

Page 733

Page 11: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010
Page 12: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

TrypanosomaLeishmania

Page 13: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Trypanasoma

• Protozoan genus with 20 species• Trypanosoma brucei cause sleeping sickness• Trypanosoma cruzi causes Chagas’ disease (S. America); infects 16-18 million people• Tsetse flies transmit trypansomes to humans• 26 Mb genome (highly variable size; 25% variation)• variable number of intermediate chromosomes• ~100 linear minichromsomes• ~9000 predicted genes (900 pseudogenes)• kinetoplast DNA: circular rings of mitochondrial DNA (resembles medieval armor)• Host immune system evasion via variant surface glycoprotein (VSG) homodimers; >1000 VSG genes but only one is expressed at a time, and only ~7% encode functional proteins

Page 735

Page 14: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Trypanosome computer problem:analyze protein AAC32813 (use BLAST, Pfam)

Page 15: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010
Page 16: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Leishmania

• ~20 Leishmania species cause leishmaniasis

• 2 million cases per year

• L. major genome ~34 Mb, 36 chromosomes

• On chromosome 1 (smallest chromosome) the first 29 genes are all transcribed from the same DNA strand, and the next 50 genes are all transcribed from the opposite strand (unprecedented in eukaryotes)

•~8200 predicted ORFs

Page 736

Page 17: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Leishmania computer problem:analyze a repetitive DNA (use EuPathDB)

Page 18: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Leishmania computer problem:analyze a repetitive DNA (use EuPathDB)

Page 19: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Chromalveolates1. Apicomplexans2. Dinoflagellates3. Ciliates (Paramecium, Tetrahymena)4. Heterokonts5. Haptophytes6. Cryptomonads

Page 20: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Individual eukaryotic genomes: malaria parasite Plasmodium falciparum

Plasmodium falciparum causes malaria, killing 2.7 millionpeople each year.

Distinguishing features: Four Plasmodium species infecthumans: P. falciparum, P. vivax, P. ovale, P. malariae.The life cycle is extremely complex.

Genome size: 22.8 Mb Chromosomes: 14 (range 0.6 to 3.3 Mb)Genes: 5268 (comparable to S. pombe)(1 gene/4300 bp)Website: http://www.plasmodb.org

P. falciparum has an adenine+thymine (AT) content of 80.6%.The P. yoelli yoelli genome was also sequenced (infects rats).

Page 738

Page 21: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Individual eukaryotic genomes: malaria parasite Plasmodium falciparum

Bioinformatics approaches to Plasmodium falciparum:

-- The apicoplast (relic plastid; fatty acid, isoprene metabolism) is a potential drug target. Apicoplast signal sequences found.-- Comparative genomics defines some gene functions, identifies genes lacking in closely related species-- Genes implicated in antigenic variation and immune system evasion can be identified (e.g. 1000 copies of vir)-- Proteomics applied to four stages of the life cycle (sporozoites, merozoites, trophozoites, gametocytes)-- Atypical metabolic pathways may be exploited, e.g. use of 1-deoxy-D-xylulose 5-phosphate (DOXP) in isoprene biosynthesis.

Page 738

Page 22: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010
Page 23: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

ParameciumTetrahymena

Page 24: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Paramecium tetraurelia

• Unicellular• Ciliate (characterized by having vibrating cilia for locomotion and food capture; and two nuclei with separate germline and somatic functions (nuclear dimorphism)• Diploid germinal micronucleus undergoes meiosis• Polyploid somatic macronucleus responsible for gene expression; lost with each generation. Amplifies to ~800 copies then undergoes rearrangment and DNA elimination• Genome sequenced (2006): 72 Mb coverage of macronucleus. 188 scaffolds may represent chromosomes (have telomeric repeats)• 40,000 protein-coding genes (~double human number)• Three whole-genome duplications

Page 742

Page 25: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Page 743

Page 26: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

all chromosome-sized scaffolds(reciprocal best hit pairs)

Aury et al. Page 744

reconstructed ancestral sequences

Page 27: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Tetrahymena

• Longstanding model organism• 104 Mb genome• ~225 chromosomes• ploidy ~45• No evidence for segmental or whole-genome duplications

Page 743

Page 28: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Stramenopila

Page 29: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Diatom

• Single-celled algae• Thalassiosira pseuonana (diatom) has three genomes:

[1] diploid nuclear genome of 34.5 Mb in 24 pairs

[2] plastid genome acquired by endosymbiosis ~1.3 BYA. This arrived when a nonphotosynthetic diatom ancestor engulfed a photosynthetic eukaryote (a red algal endosymbiont)

[3] mitochondrial genome

Page 746

Page 30: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Plants

Page 31: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Individual eukaryotic genomes: overview of plants

• Plants for a distinct clade in the eukaryotic tree• All plants are multicellular• Plants are sessile, and depend of photosynthesis (Epifagus is an exception)• Plants originated about 1.5 billion years ago (BYA), after eukaryotes had acquired a mitochondrion by endosymbiosis. Plants acquired a plastid (i.e. the chloroplast) over 1 BYA.

Page 748

Page 32: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Figure 18.11Page 749

After Myerowitz (2002)and Wang et al. (1999)

Page 33: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Individual eukaryotic genomes: overview of plants

Eudicots (e.g. Arabidopsis) diverged from monocots(e.g. rice) about 200 million years ago (MYA).

Dicots include rosids (Arabidopsis, Glycine max [soybean],M. trunculata) and asterids (e.g. Lycopersiocon esculentum[tomato]).

Monocots include cereals (seeds of flowering plants fromthe grass family).

Page 748

Page 34: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Page 749

Page 35: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Individual eukaryotic genomes: Arabidopsis thaliana

A. thaliana is a thale cress, sometimes called a weed.Distinguishing features: Rapid growth rate, extensive genetics.

Member of the Brassicaceae (mustard) family.A flowering plant (emerged 200 MYA).

Genome size: 125 Mb (very small for a plant genome). Wheat is 16.5 Gb, barley is 5 Gb. Chromosomes: 5Genes: 25,498 (comparable to human)Website: http://www.arabidopsis.org

--The entire Arabidopsis genome may have duplicated twice. -- 24 duplicated segments of > 100 kilobases

Page 751

Page 36: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

The TAIR web browser forArabidopsis

Page 37: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Second plant genome: rice

Oryza sativa is rice (subspecies indica, japonica).Distinguishing features: This crop is a staple for half the world’s population. Four groups generated draft versions.Genome size: 430 Mb (1/8th of human genome). One of the smallest grass genomes. Chromosomes: 12Genes: about 50,000? (more than human)Website: http://www.usricegenome.org (and other sites)

--The rice genome displays an unusual gradient in GC content. The mean is 43%. The 5’ end of most genes has a higher GC content than the 3’ end (by 25%). GC-rich regions occur selectively in exons (not introns).

Page 753

Page 38: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Algae computer problem:TaxPlot

Page 39: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Algae computer problem: TaxPlot

Page 40: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Slime mold

Page 41: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Individual eukaryotic genomes: the slime mold Dictyostelium discoideum

Dictyostelium discoideum is a slime mold. This forms anoutgroup to the metazoans.

Distinguishing features: The remarkable life cycle includessingle-cell and multicellular forms.

Genome size: 34 MbChromosomes: 6Genes: about 11,000Website: http://dictybase.org

--The Dicty genome has almost 80% AT content (similar to Plasmodium). Thus a whole-chromosome shotgun strategy was employed.

Page 756

Page 42: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Individual eukaryotic genomes: overview of the metazoans

The metazoans are animals including worms, insects,and vertebrates (e.g. fish and primates).

Page 758

Page 43: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Nematodes(Al Scott)

Page 44: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Individual eukaryotic genomes: the nematode C. elegans

C. elegans is a free-living soil nematode.Distinguishing features: Its genome was the first of a multi- cellular animal to be sequenced (1998).Genome size: 97 MbChromosomes: 6Genes: about 19,000 (spanning 27% of genome)Website: http://www.wormbase.org

--Many worm functional genomics projects have been performed, such as microarrays at multiple developmental stages.

Page 759

Page 45: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Insects

Page 46: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Individual eukaryotic genomes: the fruitfly Drosophila

Drosophila’s distinguishing features: Short lifecycle, varied phenotypes, model organism in genetics.Genome size: 180 MbChromosomes: 5Genes: about 13,000 (spanning 27% of genome)Website: http://www.fruitfly.org

--At the time, largest genome for which whole genome shotgun sequencing was applied.--Each genome annotation improves the gene models

Page 761

Page 47: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010
Page 48: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

This is Ann: the mosquito Anopheles gambiae

A. gambiae was the second insect genome sequenced.Distinguishing features: It is the malaria parasite vector.Genome size: 278 Mb (twice the size of Drosophila)Chromosomes: 3Genes: about 14,000 Website: http://www.ensembl.org/Anopheles_gambiae/

--Diverged from Drosophila 250 MYA (average amino acid sequence identity of orthologs is 56%). Compare human and pufferfish (diverged 400 MYA, 61% identity): insect proteins diverge at a faster rate.--High degree of genetic variation

Page 764

Page 49: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Vertebrates

Page 50: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

750 Million Years Ago: the sea squirt Ciona intestinalis

and the Road to Vertebrates

The chordates include vertebrates (fish, amphibians, reptiles, birds, mammals) which have a spinal column.

Some chordates an invertebrates, such as the sea squirt.

Genomes size: 160 Mb (20 times smaller than human)Chromosomes: 14Genes: 15,852

Significant for our understanding of vertebrate evolution.

Page 767

Page 51: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

16 MYA

chicken

450 MYA

140 MYA

310 MYA

370 MYA

zebrafishT. nigroviridishumandogratmouse

87 MYA

teleost fish

mammalsamniotes

tetrapods

amniote ancestor

tetrapod ancestor

euteleost ancestor

Page 52: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Sea Urchin Genome Sequencing Consortium et al., Science 314, 941 -952 (2006)

The phylogenetic position of the sea urchin relative to other model systems and humans

Page 53: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Sea Urchin Genome Sequencing Consortium et al., Science 314, 941 -952 (2006)

bilateria

protostomia

ecdysozoa

fruit fly nematode mollusc annelid

lophotrochozoahemi-chordata

sea urchin

echino-dermata

ascidian mouse human

vertebrata

chordatadeuterostomia

metazoa(animals)

uro-chordata

cephalo-chordata

The phylogenetic position of the sea urchin relative to other model systems and humans

Page 54: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Ciona intestinalis

Page 55: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Ciona intestinalis (sea squirt)

• The chordates appeared at the time of the Cambrian explosion (~550 MYA).

• 160 Mb genome; 14 chromosomes plus mitochrondrial. This is ~20 times smaller than the human genome.

•The Ciona genome encodes ~16,000 protein-coding genes (similar to other invertebrate genomes). 15,852 distinct gene models were obtained.

• Lineage-specific innovations e.g. genes involved in cellulose metabolism.

• High level of allelic polymorphism detected: 1.2% of the nucleotides differed between alleles (~15-fold more than in human) Page 767

Page 56: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Phylogeny of bilaterian animals (1-3). Ciona intestinalis is a member of the urochordates, the most primitively branching clade of chordates. Chordates in turn are deuterostomes, one of two great divisions (along with protostomes) of bilaterian animals.

Page 57: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

There are two recurring themes in our analysis of the ascidian genome and its relationship to the vertebrate genome. First, we find repeatedly that a family or subfamily of vertebrate genes has only a single representative in Ciona. The implication is that the Ciona gene content in these families corresponds to the complement of the ancestral chordate. Second, there were notable cases of gene families with multiple members in both Ciona and vertebrates that could not be placed in easy correspondence with one another. In these cases, the gene content of the ancestral chordate was apparently mobilized and diversified independently in the two lineages, providing examples of macroevolutionary change within the different branches of the chordate phylum.

The streamlined nature of the Ciona genome should have an enormous impact on unraveling complex developmental processes in vertebrates. For example, information about the function of the major vertebrate FGF subfamilies can be obtained in Ciona through the use of simple gene-disruption methods (e.g., morpholinos), unhampered by the complications associated with functional redundancies often encountered in large gene families.

Page 58: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Further analysis of the Ciona genome will also provide considerable information about the regulation of vertebrate genes. The Ciona cDNA projects have characterized transcripts for more than three-quarters of the genes. Systematic in situ hybridization assays are being performed to determine the spatial and temporal expression of each gene during embryonic and larval development. Thus far, there is information for nearly one-third of the genes in the Ciona genome (http://ghost.zool.kyoto-u.ac.jp/indexr1.html). Finally, all of this information--the gene-disruption data and gene-expression profiles--will be compiled along with large-scale screens for cis-regulatory DNAs to determine complete gene-regulation networks underlying the development of basic chordate features such as the neural tube and notochord.

Sea Urchin Genome Sequencing Consortium et al., Science 314, 941 -952 (2006)

Page 59: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010
Page 60: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010
Page 61: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

450 Million Years Ago:the fish Fugu rubripes

Fugu is a pufferfish (also called Takifugu rubripes).Distinguishing features: Diverged from humans 450 MYA; has comparable number of genes in a compact genome.Genome size: 365 Mb (1/10th human genome)Genes: about 30,000Website: http://genome.jgi-psf.org/fugu6/fugu6.info.html

--Only 2.7% of genome is interspersed repeats (compare 45% in human), based on RepeatMasker.--Introns are relatively short. 75% of Fugu introns are <425 base pairs (for human, 75% are <2609 base pairs).

Page 768

Page 62: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

450 Million Years Ago:sequenced fish genomes

[1] Fugu (also called Takifugu rubripes); second vertebrate genome to be sequenced after human

[2] Pufferfish Tetraodon nigroviridis

[3] Medaka Oryzias latipes

[4] Zebrafish Danio rerio

[5] Elephant shark Callorhinchus milii

Page 770

Page 63: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Fugu rubripesTakifugu rubripes

Japanese pufferfish

Page 64: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Fugu rubripes (Takifugu rubripes): pufferfish

• First vertebrate genome completed after human

• 365 Mb genome (~one eighth the size of human)

• Repetitive DNA accounts for less then 1/6th of the genome. There is also a reduction in the size of introns.

• Last common ancestor of teleost fish and humans lived 450 MYA.

• Whole-genome shotgun sequencing strategy. Error rate: five errors per 10,000 nucleotides (Phrap quality score of 33 = -10 log (5 x 10-4).

• Key website: www.fugu-sg.org

Page 770

Page 65: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

http://www.fugu-sg.org/

Page 66: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010
Page 67: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Fugu

human

GC content (%)

GC content of Fugu and human%

of

segm

ents

Page 68: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010
Page 69: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Distribution of ratios for gene locus sizes of Fugu-human orthologous pairs

ratio of Fugu to human gene locus sizes

num

ber

of o

rtho

logo

us g

ene

pairs

Page 70: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Tetraodon nigroviridisfreshwater pufferfish

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=12349

Page 71: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Tetraodon nigroviridis (teleost fish)

• Smallest known vertebrate genome

• 21 chromosomes. Ancestral bony vertebrate genome (Osteichthyes) had 12 chromosomes.

• Diverged from Takifugu 18-30 mYA

• Transposable elements are rare (only 4000 copies) but diverse (73 different types). Human and mouse genomes have millions of transposable element copies, but only ~20 types.

• 27,918 gene models (based on cDNAs; similarities to human, mouse and Takifugu; and ab initio prediction)

• Source: Jaillon O et al. Nature (2004) 431:946.Page 770

Page 72: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

16 MYA

chicken

450 MYA

140 MYA

310 MYA

370 MYA

zebrafishT. nigroviridishumandogratmouse

87 MYA

teleost fish

mammalsamniotes

tetrapods

amniote ancestor

tetrapod ancestor

euteleost ancestor

Page 73: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Tetraodon Red = 5S and 28S ribosomal RNA gene arrays

Page 74: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

identity (%)

perc

enta

ge o

f pr

otei

ns

Distribution of percent identity between pairs of orthologous proteins

Page 75: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Comparisons were performed with 2,289 proteins that are orthologous between the chordate C. intestinalis and all four vertebrates—Tetraodon, Takifugu, human and mouse (asterisks)—and with 5,802 proteins orthologous between all four vertebrates only, between fish and mammals (triangles) or between the two fish (circles), and between the two mammals (squares). As expected, all vertebrates show the same distribution profile compared to Ciona and both fish show the same distribution profile compared to mammals. Surprisingly, the distribution profile of the comparison between the two fish and between the two mammals is also very similar, despite the much shorter evolutionary time since the tetraodontiform radiation.

Page 76: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Tetraodon underwent whole genome duplication

• WGD in the ray-finned fish lineage had been controversial. After WGD, the resulting polyploid (tetraploid) returns to a diploid state through gene deletion, with few duplicated copies retained. There are two lines of evidence indicating WGD occurred:

[1] Duplicated genes (paralogs) occur on “paralogous chromosomes” (i.e. interchromosomally). 1078 pairs of duplicated genes were identified in Tetraodon nad 995 pairs in Takifugu. Based on Ks frequencies, 75% are ancient duplications (occurring before the Tetraodon-Takifugu speciation).

[2] Comparison with a related species that did not undergo WGD (human) confirms WGD. This was done using Kluyveromces waltii to show an S. cerevisiae WGD.

Page 770

Page 77: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

a, Distribution of Ks values of duplicated genes in Tetraodon (left) and Takifugu (right) genomes. Duplicated genes broadly belong to two categories, depending on their Ks value being below or higher than 0.35 substitutions per site since the divergence between the two puffer fish (arrows).

Page 78: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

b, Global distribution of ancient duplicated genes (Ks > 0.35) in the Tetraodon genome. The 21 Tetraodon chromosomes are represented in a circle in numerical order and each line joins duplicated genes at their respective position on a given pair of chromosomes.

Page 79: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

16 MYA

chicken

450 MYA

140 MYA

310 MYA

370 MYA

zebrafishT. nigroviridishumandogratmouse

87 MYA

teleost fish

mammalsamniotes

tetrapods

amniote ancestor

tetrapod ancestor

euteleost ancestor

Page 771

Page 80: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Page 771

Page 81: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

310 Million years ago:Gallus gallus: the chicken

• The chicken genome diverged from the human genome 310 MYA. It is a modern descendant of the dinosaurs and the first non-mammalian amniote to have its genome sequenced.

•Chicken genome size is ~1.2 Gb

• Chicken is a model for ~9600 extant avian species

• 38 autosomes, one pair of sex chromosomes (female ZW, male ZZ); 2n = 78.

• 571 ncRNA genes (from 20 distinct families). Estimated 20,000 to 23,000 protein-coding genes.

Source: International Chicken Genome Sequencing Consortium, Nature 2004 432:695 Page 771

Page 82: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

The horizontal axis represents estimated relative species diversity. The Archosauria include the Aves, their Mesozoic dinosaur predecessors, and Crocodilia; the Lepidosauria (lizards, snakes and tuataras) are not indicated. Archaeopteryx (indicated by an asterisk) is considered to be the first known bird and lived approximately 150 Myr ago.

Page 83: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

vertebrata

amniota

diapsida

chickenhumanmouseFugu

Cen

ozoi

c

Mes

ozoi

c

Pal

eozo

ic

Page 84: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

•  The nearly threefold difference in size between the chicken and mammalian genomes reflects a substantial reduction in interspersed repeat content, pseudogenes and segmental duplications within the chicken genome.•  Chicken−human aligned segments tend to occur in long blocks of conserved synteny. We find a relatively low rate of chromosome translocations in both lineages from the last common ancestor, whereas intrachromosomal rearrangements (for example, inversions) are more common.•  Syntenic relationships for certain classes of non-coding RNA (ncRNA) genes differ from those of protein-coding genes. This observation implies a novel mode of evolution for some ncRNA genes.•  Expansion and contraction of multigene families seem to have been major factors in the independent evolution of mammals and birds.

Page 85: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

• The sizes of chicken chromosomes, which span a range of nearly two orders of magnitude, correlate negatively with recombination rate, G+C and CpG content, and gene density but positively with repeat density.•  Synonymous substitution rates are elevated for genes in both chicken microchromosomes and in subtelomeric regions of macrochromosomes.•  There is a paucity of retroposed pseudogenes in the chicken genome, in contrast to mammalian genomes, greatly simplifying the classification of chicken gene content. This is explained by the high specificity of the reverse transcriptase from the predominant interspersed repeat element in the chicken genome: the CR1 long interspersed nucleotide element (LINE).•  Unlike all other vertebrate genomes studied so far, no short interspersed nucleotide elements (SINEs) have been active in the chicken genome for the last 50 Myr.

Page 86: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

•  Alignment of the chicken and human genomes identifies at least 70 megabases (Mb) of sequence that is highly likely to be functional in both species.•  Many of the chicken−human aligned, non-coding sequences occur far from genes, frequently in clusters that seem to be under selection for functions that are not yet understood.

Page 87: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010
Page 88: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Families of ncRNA genes in the chicken genome

Page 89: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Sensitivity of gene prediction in the chicken genome

Page 90: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Conservation near human genes in the chicken genome

Page 91: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Changes to the genome content of chicken and human

Page 92: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Family expansion in the chicken genome

Page 93: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

The short arm of chicken chromosome 4 retains ancestral microchromosome properties

Page 94: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Chicken genome synonymous substitution rates

Page 95: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

The reference structure was taken from human or mouse, and only those with cDNA-based definitions of the structure were used. The central figure shows an idealized gene structure, with the grey exons representing coding sequence and white boxes representing 3' and 5' untranslated regions.

Page 96: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Chicken genes classified according to their predicted evolutionary relationships with genes of two other model vertebrates (Fugu and human).

Page 97: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Chicken genes classified according to their predicted evolutionary relationships with genes of two other model vertebrates (Fugu and human).

Forty-three per cent of the chicken genes are present in 1:1:1 orthology relationships for the three species. Also present in three species are n:n:n (many:many:many) orthologues; putative gene duplication events have resulted in multiple genes in at least one of the species. Pairwise orthologues are assigned when orthology is not detectable in the third species. Between Fugu and chicken, pairwise orthologues are rare (as expected), and might be indicative of gene loss in the lineage leading to humans. For a substantial number of genes, clear orthology relations cannot be described at all, but some similarity to genes in the other species remains detectable ('Homology', E-value cutoff is 10-6 in Smith−Waterman searches at the protein level).

Page 98: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Sequence identity of orthologues.

a, The percentage amino acid identity distribution of 1:1 orthologues between human and chicken, between chicken and Fugu, and between human−chicken orthologues that are also present in 1:1 relationships with Fugu.

Page 99: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

b, The percentage amino acid identity distribution of human−chicken 1:1 orthologues broken down by functional categories according to the GO subtree 'biological process'. Genes assigned to multiple categories were counted more than once. Vertical bars indicate the medians of the distributions. c, The percentage amino acid identity distribution of chicken−human 1:1 orthologues broken down by tissue category. Vertical bars indicate the medians of the distributions. Female reproductive tissues include utero-vaginal, ovary and oviduct; immune system tissues include spleen, thymus, caecal tonsil and bursa of Fabricius; and alimentary tissues include gizzard, stomach, and large and small intestines.

Page 100: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010
Page 101: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Loss, innovation, expansions and contractions of protein families: domain counts and orthologous relations.

Page 102: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Loss, innovation, expansions and contractions of protein families: domain counts and orthologous relations.

All at-least-twofold over- and under-representations (separated by a solidus) are shown for both members of domain families (a) and 'many to many' orthology relations (b). Ranking of families and groups has been done with respect to the human genome; Fugu data are also shown for comparison. An asterisk indicates that manual analyses refined what were otherwise automatic counts. Families not subjected to twofold variations are not shown.

Page 103: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Approximate history of animal genomes

Page 104: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

The gene content of ancestral animals—as estimated by using automatically delineated orthologous groups (see Methods)—assuming parsimony as well as a lack of horizontal gene transfer among animals. The inferred number of lost (and gained) genes on each lineage is shown as a half-circle, the area of which is proportional to the number. The shadings under the trees give a rough indication of the number of genes per genome. Wherever necessary, the Arabidopsis genome was used as an outgroup to infer the direction of changes. Ancestral estimates close to the root are likely to be underestimates because of unrecognized orthology relations and putative ancestral genes failing to survive in any of the extant genomes. The gene count of present-day genomes was approximated by considering only genes with orthology support; any remaining genes were considered only if they had substantial similarity support within the genome (to avoid spurious gene predictions, pseudogenes and/or fragments; see Methods).

Page 105: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Relationships between chromosome sequence length and characteristics for chromosomes 1−28.

Page 106: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

a−f, Recombination rate (a), G+C content (b), densities of genes and CpG islands (c), total lengths of introns and coding exons per gene (d), intergenic gap lengths (e) and densities of interspersed repeat elements (f).

All plots exclude GGA16 and GGA22, which have insufficient sequence, and panel a also excludes GGA23 and GGA25, which have insufficient genetic markers.

Red, macrochromosomes; black, intermediate chromosomes; green, microchromosomes; additional paler colours indicate genes in c and exons in d.

Page 107: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Putative mammalian ancestor recovered by GRIMM and MGR using the human, mouse, rat (not shown) and chicken genomes.

Page 108: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

a, Each genome is represented as an arrangement of 586 synteny blocks each drawn as one unit, regardless of its length in nucleotides. Each human chromosome is assigned a unique colour, and a diagonal line is drawn through the whole chromosome. In other genomes, this diagonal line indicates the relative order and orientation of the rearranged blocks. b, The recovered ancestral X chromosome is optimal and unique. Gene order of the ancestral X chromosome is identical to human. Numbers associated with the lines indicate the minimum number of rearrangements required to convert between two nodes.

Page 109: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

80 Million Years Ago: the mouse Mus musculus

M. musculus is the second mammal to have its genome sequenced. Mouse diverged from human 75 MYA.Distinguishing features: only 300 of 30,000 annotated genes have no human orthologsGenome size: 2.5 Gb (euchromatic portion)(cf. 2.9 Gb human)Chromosomes: 6Genes: about 30,000Website: http://www.informatics.jax.org

--Dozens of mouse-specific expansions occurred, such as olfactory receptor gene family.--40% of mouse genome can be aligned to human genome at the nucleotide level.

Page 774

Page 110: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

5 to 50 Million Years Ago:primate genomes

The phylogenetic tree shows that chimpanzee (Pantroglodytes) and bonobo (pygmy chimpanzee, Panpaniscus) are the two species most closely relatedto humans. These three species diverged from acommon ancestor about 5.4 million years ago, basedon an analysis of 36 nuclear genes.

The chimpanzee genome has been sequenced. Other primate genomes (with draft sequences) include:• rhesus macaque monkey (Macaca mulatta) • olive baboon (Papio hamadryas anubis)• gorilla (Gorilla gorilla)• white-tufted-ear marmoset (Callithrix jacchus)• ring-tailed lemur (Lemur catta) Page 778

Page 111: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Page 779

Page 112: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Pan troglodytes

Page 113: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Pan troglodytes (chimpanzee)

• The chimp, bonobo (Pan paniscus, the pygmy chimpanzee) and human all diverged from a common ancestor ~5 to 7 MYA. Humans are distinguished by bipedalism, an enlarged brain, complex tool use, and complex language.

• For the chimp-human comparison, nearly all the nucleotides align. “The focus thus turns to differences rather than similarities.”

• Source: The Chimpanzee Sequencing and Analysis Consortium, Nature (2005) 437:69

Page 778

Page 114: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

* Single-nucleotide substitutions occur at a mean rate of 1.23% between copies of the human and chimpanzee genome, with 1.06% or less corresponding to fixed divergence between the species. * Regional variation in nucleotide substitution rates is conserved between the hominid and murid genomes, but rates in subtelomeric regions are disproportionately elevated in the hominids. * Substitutions at CpG dinucleotides, which constitute one-quarter of all observed substitutions, occur at more similar rates in male and female germ lines than non-CpG substitutions. * Insertion and deletion (indel) events are fewer in number than single-nucleotide substitutions, but result in approx1.5% of the euchromatic sequence in each species being lineage-specific. * There are notable differences in the rate of transposable element insertions: short interspersed elements (SINEs) have been threefold more active in humans, whereas chimpanzees have acquired two new families of retroviral elements.

Page 115: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

* Orthologous proteins in human and chimpanzee are extremely similar, with approx29% being identical and the typical orthologue differing by only two amino acids, one per lineage. * The normalized rates of amino-acid-altering substitutions in the hominid lineages are elevated relative to the murid lineages, but close to that seen for common human polymorphisms, implying that positive selection during hominid evolution accounts for a smaller fraction of protein divergence than suggested in some previous reports. * The substitution rate at silent sites in exons is lower than the rate at nearby intronic sites, consistent with weak purifying selection on silent sites in mammals. * Analysis of the pattern of human diversity relative to hominid divergence identifies several loci as potential candidates for strong selective sweeps in recent human history.

Page 116: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010
Page 117: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010
Page 118: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010
Page 119: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010
Page 120: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010
Page 121: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010
Page 122: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

a, Distribution of divergence of the autosomes (blue), the X chromosome (red) and the Y chromosome (green).

Page 123: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

a, Distribution of divergence of the autosomes (blue), the X chromosome (red) and the Y chromosome (green). b, Distribution of variation by chromosome, shown as a box plot. The edges of the box correspond to quartiles; the notches to the standard error of the median; and the vertical bars to the range. The X and Y chromosomes are clear outliers, but there is also high local variation within each of the autosomes.

Page 124: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Human−chimpanzee divergence (blue), G + C content (green) and human recombination rates173 (red) in sliding 1-Mb windows for human and chimpanzee chromosome 1. Divergence and G + C content are noticeably elevated near the 1p telomere, a trend that holds for most subtelomeric regions (see text). Internally on the chromosome, regions of low G + C content and high divergence often correspond to the dark G bands.

Page 125: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

FIGURE 5. Length distribution of small indel events, as determined using bounded sequence gaps.

Sequences present in chimpanzee but not in human (blue) or present in human but not in chimpanzee (red) are shown. The prominent spike around 300 nucleotides corresponds to SINE insertion events. Most of the indels are smaller than 20 bp, but larger indels account for the bulk of lineage-specific sequence in the two genomes.

Page 126: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Human−chimpanzee−mouse−rat tree with branch-specific KA/KS (   ) values.

a, Evolutionary tree. The branch lengths are proportional to the absolute rates of amino acid divergence. b, Maximum-likelihood estimates of the rates of evolution in protein-coding genes for humans, chimpanzees, mice and rats. In the text,  hominid is the KA/KS of the combined human and chimpanzee branches and  murid of the combined mouse and rat branches. The slight difference between  human and  chimpanzee is not statistically significant; masking of some heterozygous bases in the chimpanzee sequence may contribute to the observed difference (see Supplementary Information 'Gene evolution').

Page 127: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Human−chimpanzee−mouse−rat tree with branch-specific KA/KS (   ) values.

b, Maximum-likelihood estimates of the rates of evolution in protein-coding genes for humans, chimpanzees, mice and rats. In the text,  hominid is the KA/KS of the combined human and chimpanzee branches

and  murid of the combined mouse and rat branches. The slight

difference between  human and  chimpanzee is not statistically significant;

masking of some heterozygous bases in the chimpanzee sequence may contribute to the observed difference (see Supplementary Information 'Gene evolution').

Page 128: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Perspective and pitfalls

One of the broadest goals of biology is to understand thenature of each species: what are its mechanisms of development, metabolism, homeostasis, reproduction,and behavior? Sequencing a genome does not answerthese questions directly. After genome annotation, wetry to interpret the function of the genome’s constituentsin the context of various physiological processes.

The field of bioinformatics needs continued developmentof algorithms to find genes, repetitive sequences, genomeduplications and other features, as well as tools to identifyconserved regions. We may then generate and testhypotheses about genome function.

Page 781

Page 129: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010
Page 130: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Relative rate test

Question: what is the relative rate of evolution in lineages A and B? For example, has the rate of evolution in the human lineage accelerated since our divergence from chimpanzees? Have mice accelerated relative to rats? Can we identify a subset of genes/proteins that have undergone accelerated or decelerated evolution?

A B

Source: Li and Graur (1991)

Page 131: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Relative rate test

Consider closely related outgroup C. (For human/ chimpanzee, choose a monkey; for rat/mouse, choose guinea pig.) Recent sequencing of various genomes allows new opportunities to select appropriate outgroups.

A B C

Page 132: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Relative rate test

Let KAC = number of substitutions that occurred between species A and C.KAC, KAB, KBC can all be measured directly.

A B C

O

KAC = KOA + KOC KBC = KOB + KOC KAB = KOA + KOB Next solve these (recall that KAC, KAB, KBC are measured).Goal: find KOA, KOB, KOC to estimate relative rates.

Page 133: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Relative rate test

A B C

O

KAC = KOA + KOC KBC = KOB + KOC KAB = KOA + KOB

Next solve these (recall that KAC, KAB, KBC are measured).Goal: find KOA, KOB, KOC to estimate relative rates.

KOA = (KAC + KAB - KBC) / 2KOB = (KAB + KBC - KAC) / 2 KOC = (KAC + KBC - KAB) / 2

Finally, measure these and compare KOA versus KOB. This will estimate the relative rates of evolution.

Page 134: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Relative rate test

A B C

O

For mouse (species A) and rat (species B), the rates are nearly equal.

Sequence #sites KAB KAC- KBC Apolipoprotein E 201 7.4 1.8Total 2187 8.9 0.4 + 1.5

Page 135: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010

Relative rate test

A B C

O

In humans (species A), the rate is lower than in Old World monkeys (species B).

Sequence #sites KAB KAC- KBC globin pseudogene 2000 7.4 2.1b globin (synonymous sites) 71 8.9 2.8

Page 136: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010
Page 137: Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010