Lect. 7 & 8

8/7/2019 Lect. 7 & 8

http://slidepdf.com/reader/full/lect-7-8 1/34

Lecture 7Sequencing and assembling genomes

8/7/2019 Lect. 7 & 8


1) All heriditary information is encoded in the sequence of bases in DNA.

2) Genomes vary from 600 Kb to > 3,000 Mb but can all be sequenced

3) Genes are units of transcription. Almost all code for proteins

4) Since there is a universal 3-letter translation code, the amino acid

sequence of a protein can be determined from the nucleotide sequence

of the gene. Comparison to other proteins give useful hints to function.

This is where you need to understand protein evolution.

WHAT GENOME SEQUENCES CAN TELL US

5)Knowing which proteins are encoded in the genome of an organism

helps us understand what it can and cannot do.

6) But it does NOT tell us what it does.

8/7/2019 Lect. 7 & 8


The human genome has about 3,000 Mb carried on

22 chromosomes plus an X and a Y.It has been completely sequenced and annotated.

How is it done?

DNA is just a string of 4 bases - A,T,G,C - but a very long string!

8/7/2019 Lect. 7 & 8


Bob Waterston John Sulston Craig Venter

8/7/2019 Lect. 7 & 8


Genomic sequencing is an industrial, high-throughput process

(not to be carried out in an academic laboratory - Craig Venter)

8/7/2019 Lect. 7 & 8


Shot-gun sequencing is the way to go.

Sequence a lot of short fragments andassemble them on the basis of overlap.

8/7/2019 Lect. 7 & 8


DETERMINING A SEQUENCE

A low proportion of dideoxynucleotide

triphosphates terminates the copies made

by DNA polymerase.

ddATP terminates where A is coded,

ddGTP terminates where G is coded.

etc.

The fragments are separated by length

and the 4 bases are read on the basis

of the dye they incorporate.

Gel electrophoresis can separate

a fragment of 500 from one of 501 bases,

but seldom can separate a fragment of

800 bases from one of 801.

So “reads” are usually only 500 bp long.

It takes a thousand reads to sequence a

fragment of 100 Kb because you need

at least 5X redundancy. It takes millions

of reads to cover a 100 Mb genome.

8/7/2019 Lect. 7 & 8


Redundancy improves accuracy and generates overlaps

8/7/2019 Lect. 7 & 8


reads (500 bp)

contigs (5 Kb)

metacontigs (50 Kb)

BACs (200 Kb)

markers

chromosomes

Paired reads and BAC end sequencing establishes overlap and gaps

Assembly on the basis of sequence identity in region of overlap (PHRAP)

8/7/2019 Lect. 7 & 8


Shear DNA into fragments of ~2 Kb.

Ligate into a plasmid.

Transform E. coli with plasmid.

Pick thousands of individual clonesROBOTICALLY.

Store in 96-well plates.

Since random inserts are sequenced,

statistically 7 fold redundancyis needed to cover >99.9% of the DNA.

Use cost-effective universal primers that re-

cognize flanking plasmid sequences togenerate dye-marked fragments terminated by

incorporation of dideoxy-nucleotide.

Separate fragments electrophoretically

in an automatic sequencer. The manufacturer’s

computer program then calls the bases.

A program such as PHRED can establish

confidence levels.

8/7/2019 Lect. 7 & 8


Paired reads can establish

gap size if the average insert length

of cloned fragments is known.

Clones can then be sequenced

to fill the gaps.

End sequencing of large insert

clones carried in BACs or YACs

can generate metacontigs.

Shot-gun sequencing of a 200 Kbinsert that covers a gap can fill the

gap.

PCR fragments up to several Kb

covering the gaps can be

generated and sequenced.

Errors in assembly can be recognized

when PCR fails to generate

fragments of expected size.

ASSEMBLY AND FINISHING

8/7/2019 Lect. 7 & 8


This shot-gun approach works well for segments up to 50 Kb

but is more problematic for large insert clones with >200Kbbecause incorrect assembly can result from low-information

regions and repetitive elements.

There are various programs that attack the assembly problem

such as PHRAP and EULER. They all benefit from having

>7 fold depth of coverage to reduce errors in the

finished sequence. Error-free sequences can often beuniquely assembled but can benefit from independent

mapping data.

At the chromosome level, assembly is a mapping problem.

8/7/2019 Lect. 7 & 8


Sequencing of model organisms with small genomes led the way.

Organism Genome Size

bacteria 1 to 5 Mb

yeast 12 Mb

C.elegansDrosophila

Dictyostelium

100 Mb120 Mb

34 Mb

human 3,000 Mb

Number of Genes

1,000 to 3,000

6,000

18,000

14,000

12,000

25,000

Physical (sequence based) maps were generated in different

ways in each of these organisms.

8/7/2019 Lect. 7 & 8


3034

3022

3245

3197

3331

3200

3631

3081

3202

3582

3561

3470

3969

3453

3238

3127

3438

3100

3597

3850

3490

3159

3817

3015 3693

3235

3471

3372

3219

3241

3689

3669

3961 3112

3160

3097

3234

3002

3307

3167

3609

364937493957

3180

3742

3574

3489

3818 3400

35673322

3350

3633

3260

3873

3126

3479

3030

3718

3776

3883

4007

4004

3254

3037

4005

3884 3053

3789

3052

3906

39603959

3142

3696

3083

Large insert YACs were screened for physically

mapped markers and a tiling set chosen that covers

each chromosome

Overlapping YACs formed scaffolds for assembly of

sequence-based contigs and confirmed the legacy map.Position and order was verified by HAPPY mapping

DIRS dhkA vatM manA gluA rasDmyoM pabA vsgB

C6

8/7/2019 Lect. 7 & 8


Summary of genome sequencing methodology

1) Sequence 500 bp from each end of fragments clonedin a plasmid using primers that start within the plasmid

sequence.

2) Assemble contigs on the basis of sequence overlap.

3) Use paired-reads to recognize gaps.

4) End sequence large inserts (~200 kb) carried in BACs

or YACs to generate a scaffold.

5) Position scaffolds on chromosomes using physically

mapped markers. Fill the gaps.

6) Declare the sequence done!

8/7/2019 Lect. 7 & 8


The sequence of DNA (the genotype) is replicated at each cell division,but it is the phenotype that matters.

Genes make proteins and the phenotype is determined by the proteins

that accumulate in different cell types.But only 2% of human DNA encodes proteins. Genes are hard to find.

The Proteome is the complete repetoire of proteins that a species

can make.

Genes can be recognized in the DNA sequence on the basis of coding

potential.

Genes can also be recognized from their transcribed mRNA sequences.

A higher percentage of the DNA encodes proteins in organisms

with smaller genomes.

GENES

8/7/2019 Lect. 7 & 8


Sequencing of model organisms with small genomes led the way.

Organism Genome Size

bacteria 1 to 5 Mb

yeast 12 Mb

C.elegansDrosophila

Dictyostelium

100 Mb

120 Mb

34 Mb

human 3,000 Mb

Number of Genes

1,000 to 3,000

6,000

18,000

14,000

12,000

25,000

Physical (sequence based) maps were generated in different

ways in each of these organisms.

8/7/2019 Lect. 7 & 8


>JC1b01d12.r1 contig

AAAAAAAAAA AAAC GATATT TGTTAAATTT CAACTTTCAA ATAATGACAG AACCTGTTGC

xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxUUUUUU UUUCCCCCCC CCCCCCCCCC

TGCACCAAAA AAAAAGATTG TTTTAAAGA G AGCAGCTGGT AGTAGTTCAT CCAATGAATTCCCCCCCC CC CCCCCCC CCC CCCCCCCCCC CCCCCCCCC C CCCCCCCCCC CCCCCCC CCC

TAAAATTGAA TCAATTGATA AAACTTTTGG TAATTATTAT TATTATTATT TAAATTTATT

CCCCCCCCCC CCCCCCCCCC CCCCCCCCC1 1111111111 1111111111 1111111111

ATTTAAAAGA AAAAATAAAA ATGTTTAACT TTTTTTTTTT TTTTTTTTTA GAATTACCAA

1111111111 1111111111 1111111111 1111111111 1111111111 1CCCCCCCCC

ATCATTTAAA AAAAGTAAAT GAAAATTTTA ATAATAAATC AAGTACAATT TATAATGTATCCCCCCCC CC CCCCCCC CCC CCCCCCCCCC CCCCCCCCC C CCCCCCCCCC CCCCCCC CCC

ATGAAAAACA AGCAACTGAT ATATTTACAA ATTGGATAAA AGAAAAAAGA TATATCTTAG

CCCCCCCC CC CCCCCCC CCC CCCCCCCCCC CCCCCCCCC C CCCCCCCCCC CCCCCCC CCC

ATGTTTGGTC TTAA GAATAA AAAAATAAAA AATACAAATA TGAATAATAA AATAAAAATG

CCCCCCCCCC CCCCuuuuuu uuuuuuuuuu uuuuuuuuuu uuuuuuuuuu uuuuuuxxxx

GCTTTATTTA ATTATTTTAA ATTTAATTTT CCCATTTGTT TTTGTAATTT CTTTTCTTCCxxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx

TTTTG GGCCG TTTTTTAATT TTTTTTTTTT TTTGTGATTT TTAATTTAAA AAAAAAAAAA

xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx

AAAAAAATAA ATAAATAAAA AAAAAAGAAT GTTTAGAATA ACAAAATTTA ATAAATATTA

xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx

TAATAAATTT AGGTCATTTA AAAGAAAAAA TATAATTTCC ATA

xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxx

numgf 19 >JC1b01d12.r1 pORF 272--1295 strand f start y stop y

MSNKVGNSKNNKNKSIKFAPKHKDKSYDNEDFNAVSKKSSISVSDLPTKGEEKHRIMALS

FPIKLSM W DFGQCDSKKCTGRKLERLGYVKSINLTHKFKGIVLTPSAKQSISPADRDIVQ

NLGVSVVDCS WAKVDSIPFGK MK GGHD RLLPFLIAANPVNYGKPFKLTCVEAVAACLFIT

GFTAEGHQVLGGFK W GSSFYKVNKDLFEKYILCANSQEVV QIQNEFIAKCEQD QKDRAAN

IEYDEFGLQLNPNRILRTNNDD DEENGDEDYCDDDDEDEDEEDEEEDHECDSECD HDEEE

EEDNDE

HMMgene Prediction

of ORFs

start of ORF (ATG)

splice site

splice site

translation termination [end of ORF]

8/7/2019 Lect. 7 & 8


Classification of chromosome 2 encoded proteins

using GO terminology

Process

Function

8/7/2019 Lect. 7 & 8


Summary of methodolgy for recognizing genes

ExperimentalSequence cDNAs from a large number of mRNAs and compare to genomic sequence.

ComputationalTrain a HMM program to recognize start sites, exons, splice sites,

introns, and termination sites of ORFs. Predict genes.

Compare predicted proteins to legacy protein sequences.

Assign likely function to proteins.

ExperimentalIn situ hybridization to determine cell type expression.

Molecular genetics (knock outs etc) to determine function.

8/7/2019 Lect. 7 & 8


Lecture 8

Exon shuffling and Gene Loss

New proteins can arise by incorporating domains from

other proteins. This process is aided by exon shuffling

but exons do not define domains.

The genetic repetoire carried by the common ancestor of plants, animals and fungi may have been larger than

what is found in any of these kingdoms now.

Specialized organisms shed genes.

When you have a whole genome sequenced, missing genes

can be recognized.

8/7/2019 Lect. 7 & 8


Many large (> 200 aa) proteins have multiple, partially independent

domains. Some of these domains are found in various different proteins.

Common domains

When organisms evolved a closed circulatory system about 400 Myrs ago,

there was a strong selection for clotting proteins to fill any accidental leaks.

Factor XII, a protein of 600 amino acids, is one of the clotting factors.It "borrowed" several previously established domains.

EGF kringle

(found in several

clotting factors

as well as proteases.)

kallikein

(another clotting factor domain

also found in peptidases.)

EGF

(EGF domain is found in many

extra-cellular proteins and receptors.

It is often involved in protein-protein interactions.)

fibronectin I

8/7/2019 Lect. 7 & 8


Exons in the 9 kb Factor XII gene

exon 4 exon 6 exon 7exon 5

exon 4 exon 6 exon 7exon 5

x x

Exon shuffling

Since introns are spliced out, length is not important.

The 5' exon/intron border [x/GT...] may be:

at the end of a codon (phase 0)

include the first base of a codon (phase 1)or include 2 bases of a codon (phase 2).

Introns must begin and end in the same phase class [AG/yz].

Therefore, an inserted exon must have the same phase group

as the flanking exons. Many inserted exons are class 1.

8/7/2019 Lect. 7 & 8


Exon shuffling can not only add a new exon, but can also

duplicate existing exons or delete an exon.

The boundary amino acids encoded by an exon often do

not coincide with the boundaries of domains (contrary to

what some have proposed).

However, if a domain is encoded within an exon, it can

become a mobile module.

Protein modules

8/7/2019 Lect. 7 & 8


Genes can either arise in a specific lineage or be lost in a related lineage.

Evidence for gene loss requires that orthologs be present in "flanking" species

derived from a common ancestor.

1002 genes present in tomato were not found in Arabidopsis. 154 were

clearly present in either soy or Medicago. These are cases of gene loss in

the Arabidopsis lineage.

Some highly conserved genes that are present in both monocots and

dicots have been lost in Arabidopsis.

One of them, slr2032, appears to have come from the Synechocystis-likegenome that gave rise to the chloroplast.

Gene loss

8/7/2019 Lect. 7 & 8


number of genes

in chloroplast genome

slr2032 found in

chloroplast genome

slr2032 missing fromboth chloroplast and

nuclear genomes

slr2032 found innuclear genome

History of gene slr2032

a "primitive" alga

a diatom

Arabidopsis

algae

monocots

a blue-green bacterium

8/7/2019 Lect. 7 & 8


Leishmania

Arabidopsis

Oryza

Plasmodium

Neurospora

Schizosaccharomyces

Saccharomyces

HomoFugu

Caenorhabditis

Drosophila

Anopheles

Ciona

100 Darwins

Dictyostelium

bacteria

The complement of ancient genes available to the common ancestor of the

crown organisms included genes with orthologs now in early diverging organisms

as well as either plants, fungi or animals. Likewise, genes with orthologs in both

a plant and a fungus or an animal were available to the common ancestor.

2,258 such genes have been recognized

8/7/2019 Lect. 7 & 8


Dicty

Arab. Arab. Dros.Sacch.

Dicty

12

2 20

31

1

15

20

30.5

9

73

2

Percent of ancient genes that have been lost or highly modified since the plant/animal divergence

Percent of ancient genes that have been retained

Dicty

Arab. Arab. Dros.Sacch.

Dicty

1

15 145

20

12

2

0.5

720

3

39

56

8/7/2019 Lect. 7 & 8


8/7/2019 Lect. 7 & 8


Comparison of members of large families of related genes

in diverse organisms can uncover a history of gene loss

and domain loss.

The ABC family is one of the largest in eukaryotic genomes.

They encode half-transporters with one ABC domain and

full-transporters with two ABC domains. Many havetransmembrane domains.

8/7/2019 Lect. 7 & 8


The ABC superfamily of transporters all have related

ATP-binding cassettes. There are 8 families.

There are 68 ABC genes in the Dictyostelium genome.

Full transportersHalf-transporters

8/7/2019 Lect. 7 & 8


There are 11 ABCA genes in Dictyostelium.

Fungi have no genes of this family.

In animals ABCA proteins all have two transmembrane domains

(humans have 12 such genes)In plants there is one gene with two domains and 16 with a single domain.

There appears to have been several cases of gene loss affecting whole

kingdoms.

8/7/2019 Lect. 7 & 8


The ABCG family is the only one in which the ABC cassette

preceeds the transmembrane domain. The progenetor may have

arisen by fusion of domains or domain loss.

8/7/2019 Lect. 7 & 8


Summary

Exon shuffling can facilitate insertion, deletion or duplication of a proteindomain.

Genes duplicate and diverge, but sometimes both copies

are lost because they are not needed in a new context (new species).When a whole genome sequence is available, the LACK of a given

gene can be definitive.

Some highly conserved genes, such as slr2032, are missing in Arabidopsis.What has taken over their functions?

Analyses of proteins that are members of superfamilies can uncover histories

of gene loss. In the ABCA family the last copy of a half-transporter was lostbetween the time that Dictyostelium diverged and the time that fungi diverged.

The last copy of a full-transporter was lost in the line leading to fungi.

There are several ways by which domain order can be rearranged.

Documents

Lect. 7 & 8