Upload
nidharshini-govindaraj
View
222
Download
0
Embed Size (px)
Citation preview
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 1/34
Lecture 7Sequencing and assembling genomes
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 2/34
1) All heriditary information is encoded in the sequence of bases in DNA.
2) Genomes vary from 600 Kb to > 3,000 Mb but can all be sequenced
3) Genes are units of transcription. Almost all code for proteins
4) Since there is a universal 3-letter translation code, the amino acid
sequence of a protein can be determined from the nucleotide sequence
of the gene. Comparison to other proteins give useful hints to function.
This is where you need to understand protein evolution.
WHAT GENOME SEQUENCES CAN TELL US
5)Knowing which proteins are encoded in the genome of an organism
helps us understand what it can and cannot do.
6) But it does NOT tell us what it does.
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 3/34
The human genome has about 3,000 Mb carried on
22 chromosomes plus an X and a Y.It has been completely sequenced and annotated.
How is it done?
DNA is just a string of 4 bases - A,T,G,C - but a very long string!
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 4/34
Bob Waterston John Sulston Craig Venter
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 5/34
Genomic sequencing is an industrial, high-throughput process
(not to be carried out in an academic laboratory - Craig Venter)
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 6/34
Shot-gun sequencing is the way to go.
Sequence a lot of short fragments andassemble them on the basis of overlap.
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 7/34
DETERMINING A SEQUENCE
A low proportion of dideoxynucleotide
triphosphates terminates the copies made
by DNA polymerase.
ddATP terminates where A is coded,
ddGTP terminates where G is coded.
etc.
The fragments are separated by length
and the 4 bases are read on the basis
of the dye they incorporate.
Gel electrophoresis can separate
a fragment of 500 from one of 501 bases,
but seldom can separate a fragment of
800 bases from one of 801.
So “reads” are usually only 500 bp long.
It takes a thousand reads to sequence a
fragment of 100 Kb because you need
at least 5X redundancy. It takes millions
of reads to cover a 100 Mb genome.
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 8/34
Redundancy improves accuracy and generates overlaps
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 9/34
reads (500 bp)
contigs (5 Kb)
metacontigs (50 Kb)
BACs (200 Kb)
markers
chromosomes
Paired reads and BAC end sequencing establishes overlap and gaps
Assembly on the basis of sequence identity in region of overlap (PHRAP)
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 10/34
Shear DNA into fragments of ~2 Kb.
Ligate into a plasmid.
Transform E. coli with plasmid.
Pick thousands of individual clonesROBOTICALLY.
Store in 96-well plates.
Since random inserts are sequenced,
statistically 7 fold redundancyis needed to cover >99.9% of the DNA.
Use cost-effective universal primers that re-
cognize flanking plasmid sequences togenerate dye-marked fragments terminated by
incorporation of dideoxy-nucleotide.
Separate fragments electrophoretically
in an automatic sequencer. The manufacturer’s
computer program then calls the bases.
A program such as PHRED can establish
confidence levels.
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 11/34
Paired reads can establish
gap size if the average insert length
of cloned fragments is known.
Clones can then be sequenced
to fill the gaps.
End sequencing of large insert
clones carried in BACs or YACs
can generate metacontigs.
Shot-gun sequencing of a 200 Kbinsert that covers a gap can fill the
gap.
PCR fragments up to several Kb
covering the gaps can be
generated and sequenced.
Errors in assembly can be recognized
when PCR fails to generate
fragments of expected size.
ASSEMBLY AND FINISHING
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 12/34
This shot-gun approach works well for segments up to 50 Kb
but is more problematic for large insert clones with >200Kbbecause incorrect assembly can result from low-information
regions and repetitive elements.
There are various programs that attack the assembly problem
such as PHRAP and EULER. They all benefit from having
>7 fold depth of coverage to reduce errors in the
finished sequence. Error-free sequences can often beuniquely assembled but can benefit from independent
mapping data.
At the chromosome level, assembly is a mapping problem.
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 13/34
Sequencing of model organisms with small genomes led the way.
Organism Genome Size
bacteria 1 to 5 Mb
yeast 12 Mb
C.elegansDrosophila
Dictyostelium
100 Mb120 Mb
34 Mb
human 3,000 Mb
Number of Genes
1,000 to 3,000
6,000
18,000
14,000
12,000
25,000
Physical (sequence based) maps were generated in different
ways in each of these organisms.
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 14/34
3034
3022
3245
3197
3331
3200
3631
3081
3202
3582
3561
3470
3969
3453
3238
3127
3438
3100
3597
3850
3490
3159
3817
3015 3693
3235
3471
3372
3219
3241
3689
3669
3961 3112
3160
3097
3234
3002
3307
3167
3609
364937493957
3180
3742
3574
3489
3818 3400
35673322
3350
3633
3260
3873
3126
3479
3030
3718
3776
3883
4007
4004
3254
3037
4005
3884 3053
3789
3052
3906
39603959
3142
3696
3083
Large insert YACs were screened for physically
mapped markers and a tiling set chosen that covers
each chromosome
Overlapping YACs formed scaffolds for assembly of
sequence-based contigs and confirmed the legacy map.Position and order was verified by HAPPY mapping
DIRS dhkA vatM manA gluA rasDmyoM pabA vsgB
C6
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 15/34
Summary of genome sequencing methodology
1) Sequence 500 bp from each end of fragments clonedin a plasmid using primers that start within the plasmid
sequence.
2) Assemble contigs on the basis of sequence overlap.
3) Use paired-reads to recognize gaps.
4) End sequence large inserts (~200 kb) carried in BACs
or YACs to generate a scaffold.
5) Position scaffolds on chromosomes using physically
mapped markers. Fill the gaps.
6) Declare the sequence done!
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 16/34
The sequence of DNA (the genotype) is replicated at each cell division,but it is the phenotype that matters.
Genes make proteins and the phenotype is determined by the proteins
that accumulate in different cell types.But only 2% of human DNA encodes proteins. Genes are hard to find.
The Proteome is the complete repetoire of proteins that a species
can make.
Genes can be recognized in the DNA sequence on the basis of coding
potential.
Genes can also be recognized from their transcribed mRNA sequences.
A higher percentage of the DNA encodes proteins in organisms
with smaller genomes.
GENES
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 17/34
Sequencing of model organisms with small genomes led the way.
Organism Genome Size
bacteria 1 to 5 Mb
yeast 12 Mb
C.elegansDrosophila
Dictyostelium
100 Mb
120 Mb
34 Mb
human 3,000 Mb
Number of Genes
1,000 to 3,000
6,000
18,000
14,000
12,000
25,000
Physical (sequence based) maps were generated in different
ways in each of these organisms.
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 18/34
>JC1b01d12.r1 contig
AAAAAAAAAA AAAC GATATT TGTTAAATTT CAACTTTCAA ATAATGACAG AACCTGTTGC
xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxUUUUUU UUUCCCCCCC CCCCCCCCCC
TGCACCAAAA AAAAAGATTG TTTTAAAGA G AGCAGCTGGT AGTAGTTCAT CCAATGAATTCCCCCCCC CC CCCCCCC CCC CCCCCCCCCC CCCCCCCCC C CCCCCCCCCC CCCCCCC CCC
TAAAATTGAA TCAATTGATA AAACTTTTGG TAATTATTAT TATTATTATT TAAATTTATT
CCCCCCCCCC CCCCCCCCCC CCCCCCCCC1 1111111111 1111111111 1111111111
ATTTAAAAGA AAAAATAAAA ATGTTTAACT TTTTTTTTTT TTTTTTTTTA GAATTACCAA
1111111111 1111111111 1111111111 1111111111 1111111111 1CCCCCCCCC
ATCATTTAAA AAAAGTAAAT GAAAATTTTA ATAATAAATC AAGTACAATT TATAATGTATCCCCCCCC CC CCCCCCC CCC CCCCCCCCCC CCCCCCCCC C CCCCCCCCCC CCCCCCC CCC
ATGAAAAACA AGCAACTGAT ATATTTACAA ATTGGATAAA AGAAAAAAGA TATATCTTAG
CCCCCCCC CC CCCCCCC CCC CCCCCCCCCC CCCCCCCCC C CCCCCCCCCC CCCCCCC CCC
ATGTTTGGTC TTAA GAATAA AAAAATAAAA AATACAAATA TGAATAATAA AATAAAAATG
CCCCCCCCCC CCCCuuuuuu uuuuuuuuuu uuuuuuuuuu uuuuuuuuuu uuuuuuxxxx
GCTTTATTTA ATTATTTTAA ATTTAATTTT CCCATTTGTT TTTGTAATTT CTTTTCTTCCxxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx
TTTTG GGCCG TTTTTTAATT TTTTTTTTTT TTTGTGATTT TTAATTTAAA AAAAAAAAAA
xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx
AAAAAAATAA ATAAATAAAA AAAAAAGAAT GTTTAGAATA ACAAAATTTA ATAAATATTA
xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx
TAATAAATTT AGGTCATTTA AAAGAAAAAA TATAATTTCC ATA
xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxx
numgf 19 >JC1b01d12.r1 pORF 272--1295 strand f start y stop y
MSNKVGNSKNNKNKSIKFAPKHKDKSYDNEDFNAVSKKSSISVSDLPTKGEEKHRIMALS
FPIKLSM W DFGQCDSKKCTGRKLERLGYVKSINLTHKFKGIVLTPSAKQSISPADRDIVQ
NLGVSVVDCS WAKVDSIPFGK MK GGHD RLLPFLIAANPVNYGKPFKLTCVEAVAACLFIT
GFTAEGHQVLGGFK W GSSFYKVNKDLFEKYILCANSQEVV QIQNEFIAKCEQD QKDRAAN
IEYDEFGLQLNPNRILRTNNDD DEENGDEDYCDDDDEDEDEEDEEEDHECDSECD HDEEE
EEDNDE
HMMgene Prediction
of ORFs
start of ORF (ATG)
splice site
splice site
translation termination [end of ORF]
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 19/34
Classification of chromosome 2 encoded proteins
using GO terminology
Process
Function
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 20/34
Summary of methodolgy for recognizing genes
ExperimentalSequence cDNAs from a large number of mRNAs and compare to genomic sequence.
ComputationalTrain a HMM program to recognize start sites, exons, splice sites,
introns, and termination sites of ORFs. Predict genes.
Compare predicted proteins to legacy protein sequences.
Assign likely function to proteins.
ExperimentalIn situ hybridization to determine cell type expression.
Molecular genetics (knock outs etc) to determine function.
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 21/34
Lecture 8
Exon shuffling and Gene Loss
New proteins can arise by incorporating domains from
other proteins. This process is aided by exon shuffling
but exons do not define domains.
The genetic repetoire carried by the common ancestor of plants, animals and fungi may have been larger than
what is found in any of these kingdoms now.
Specialized organisms shed genes.
When you have a whole genome sequenced, missing genes
can be recognized.
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 22/34
Many large (> 200 aa) proteins have multiple, partially independent
domains. Some of these domains are found in various different proteins.
Common domains
When organisms evolved a closed circulatory system about 400 Myrs ago,
there was a strong selection for clotting proteins to fill any accidental leaks.
Factor XII, a protein of 600 amino acids, is one of the clotting factors.It "borrowed" several previously established domains.
EGF kringle
(found in several
clotting factors
as well as proteases.)
kallikein
(another clotting factor domain
also found in peptidases.)
EGF
(EGF domain is found in many
extra-cellular proteins and receptors.
It is often involved in protein-protein interactions.)
fibronectin I
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 23/34
Exons in the 9 kb Factor XII gene
exon 4 exon 6 exon 7exon 5
exon 4 exon 6 exon 7exon 5
x x
Exon shuffling
Since introns are spliced out, length is not important.
The 5' exon/intron border [x/GT...] may be:
at the end of a codon (phase 0)
include the first base of a codon (phase 1)or include 2 bases of a codon (phase 2).
Introns must begin and end in the same phase class [AG/yz].
Therefore, an inserted exon must have the same phase group
as the flanking exons. Many inserted exons are class 1.
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 24/34
Exon shuffling can not only add a new exon, but can also
duplicate existing exons or delete an exon.
The boundary amino acids encoded by an exon often do
not coincide with the boundaries of domains (contrary to
what some have proposed).
However, if a domain is encoded within an exon, it can
become a mobile module.
Protein modules
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 25/34
Genes can either arise in a specific lineage or be lost in a related lineage.
Evidence for gene loss requires that orthologs be present in "flanking" species
derived from a common ancestor.
1002 genes present in tomato were not found in Arabidopsis. 154 were
clearly present in either soy or Medicago. These are cases of gene loss in
the Arabidopsis lineage.
Some highly conserved genes that are present in both monocots and
dicots have been lost in Arabidopsis.
One of them, slr2032, appears to have come from the Synechocystis-likegenome that gave rise to the chloroplast.
Gene loss
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 26/34
number of genes
in chloroplast genome
slr2032 found in
chloroplast genome
slr2032 missing fromboth chloroplast and
nuclear genomes
slr2032 found innuclear genome
History of gene slr2032
a "primitive" alga
a diatom
Arabidopsis
algae
monocots
a blue-green bacterium
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 27/34
Leishmania
Arabidopsis
Oryza
Plasmodium
Neurospora
Schizosaccharomyces
Saccharomyces
HomoFugu
Caenorhabditis
Drosophila
Anopheles
Ciona
100 Darwins
Dictyostelium
bacteria
The complement of ancient genes available to the common ancestor of the
crown organisms included genes with orthologs now in early diverging organisms
as well as either plants, fungi or animals. Likewise, genes with orthologs in both
a plant and a fungus or an animal were available to the common ancestor.
2,258 such genes have been recognized
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 28/34
Dicty
Arab. Arab. Dros.Sacch.
Dicty
12
2 20
31
1
15
20
30.5
9
73
2
Percent of ancient genes that have been lost or highly modified since the plant/animal divergence
Percent of ancient genes that have been retained
Dicty
Arab. Arab. Dros.Sacch.
Dicty
1
15 145
20
12
2
0.5
720
3
39
56
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 29/34
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 30/34
Comparison of members of large families of related genes
in diverse organisms can uncover a history of gene loss
and domain loss.
The ABC family is one of the largest in eukaryotic genomes.
They encode half-transporters with one ABC domain and
full-transporters with two ABC domains. Many havetransmembrane domains.
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 31/34
The ABC superfamily of transporters all have related
ATP-binding cassettes. There are 8 families.
There are 68 ABC genes in the Dictyostelium genome.
Full transportersHalf-transporters
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 32/34
There are 11 ABCA genes in Dictyostelium.
Fungi have no genes of this family.
In animals ABCA proteins all have two transmembrane domains
(humans have 12 such genes)In plants there is one gene with two domains and 16 with a single domain.
There appears to have been several cases of gene loss affecting whole
kingdoms.
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 33/34
The ABCG family is the only one in which the ABC cassette
preceeds the transmembrane domain. The progenetor may have
arisen by fusion of domains or domain loss.
8/7/2019 Lect. 7 & 8
http://slidepdf.com/reader/full/lect-7-8 34/34
Summary
Exon shuffling can facilitate insertion, deletion or duplication of a proteindomain.
Genes duplicate and diverge, but sometimes both copies
are lost because they are not needed in a new context (new species).When a whole genome sequence is available, the LACK of a given
gene can be definitive.
Some highly conserved genes, such as slr2032, are missing in Arabidopsis.What has taken over their functions?
Analyses of proteins that are members of superfamilies can uncover histories
of gene loss. In the ABCA family the last copy of a half-transporter was lostbetween the time that Dictyostelium diverged and the time that fungi diverged.
The last copy of a full-transporter was lost in the line leading to fungi.
There are several ways by which domain order can be rearranged.