Lecture 10 : Whole genome
sequencing and analysis
Introduction to Computational
Teresa Przytycka, PhD
• Goal – obtain the string of bases that make a given
• Problem –Typically one cans sequence directly
only DNA of short length (400-700 bp – Sanger;
Cutting and breaking DNA
• Restriction enzymes – proteins that catalyze hydrolysis
(breaking the molecule by adding water) of DNA at
certain points called restriction sides.
• Example: EcoRI restriction side GAATTC. Note that the
complement of GAATTC is GAATTC (a sequence equal
to its reverse is called a palindrome)
• After DNA fragments (reads) are sequenced
we want to assemble then together to
reconstruct the entire target sequence.
• If the overlaps were unique and error free,
this would be relatively easy task… but they
• In addition : fragments can come from any of the
two DNA strands and we do not know which
The “ideal” example
Assume target sequence of about 10bp.
- - ACCGT
- - - - CGTGC
TTAC - - - - -
- TACCGT - -
TTACCGTGC consensus sequence
• After DNA fragments (reads) are sequenced we
want to assemble then together to reconstruct the
entire target sequence.
• Most fragment assembly algorithms include the
following 3 steps:
– Overlap - finding potentially overlapping fragments
– Layout – finding the order of the fragments
– Consensus – deriving DNA sequence from the layout.
• Usually we know with some approximation the
length of the target sequence.
• In theory we should test for overlaps all pairs of
fragments. For every pair we will consider all
• One possible method: perform alignment without
charging for flanking gaps
- - TAATG
TGTAA - -
F - fragments. Overlap graph :
vertices = elements of F
weighted edges: if a, b ∈ F then the weight of
edge from a to b is equal t where maximum
integer such that
suffix(a,t) = prefix(b,t)
suffix(a,t) = last t symbols of a
prefix(b,t) = first t symbols of b
Path dbc leads to alignment
Path abcd leads to alignment
Each simple path (simple = not using the
same vertex more than once) in overlap
graph defines an alignment.
- no fragment completely included in
- Direction of fragments is known
Definition: Hamiltonian path – a path that
visits each vertex exactly once.
Let P – path, A the set of fragments
involved in A
|S(P)| = ||A|| - w(P)
Where ||A|| sum of lengths of fragments in
w(P) the sum of weight of path P (sum of
the edge weights on this paths).
The greedy algorithm
• Goal: find a Hamiltonian path with large
• Heuristic: iteratively find the heavies edge
and try to add it to the path:
• Acceptance test: An edge can be added to
the path, if it will not create brunching point
on the path.
sort edges by weight
for each edge (f,g) in decreasing order
perform acceptance test for (f,g)
if accepted add it to the path
Try: (a,d) – ok, selected
Try: (d,b) – ok, selected
Try: (a,b) – acceptance test false
Try: (b,c) – ok, selected
From Setubal/Meidanis book
Complication - repeated regions
Repeated regions: sequences that appears more than once in the molecule. The
copies of repeats do not need to be exactly the same. Problems are illustrated
From Setubal/Meidanis book
Coverage and linkage
• coverage = number of times given position is
included in a an aligned fragment.
• if a coverage equals 0 at some column – we do not
have continuous layout.
• linkage amount of overlaps between fragments:
From Setubal/Meidanis book
Complication – lack of coverage
• Coverage at position i of the target is the
number of fragments that cover this
• A conting – continuously covered region.
• sequence walking (direct sequencing)
- derive a primer from a sequence near the end of a conting
- replicate the sequence starting at the primer
- sequence this the replicated sequence
- if the replicated sequence did not cover the gap, repeat the
- Problems: tedious for larger gap, region of interest must be
unique in the genome
• dual end sequencing. Recall that the inserts are much longer
than the sequenced fragments. If we sequence both ends of the
insert, we obtain mate pairs which can be used as follows:
if two ends of a mate pair are in two different contigs, we can
deduce the orientation and distance between two contings.
Scaffold – sequence of contigs where the order and distances
between the contigs are approximately known.,
What do we learn form whole genome
• Using gene finding algorithm we can
discover significant portion of genes
• Understand the structure of a genome
• Understand genome evolution
• Searching for genes associated with
• Gene duplication – widely accepted method for
creation of new genes
• Ohno proposes that whole genome duplication
(polyploidization) provides material for new
• 2R Hypothesis: two rounds of polyploidization
followed by gene loss and functional divergence
occurred early in vertebrate lineage.
Results filtered to report segments at least 1000bp, at lest 59%
NATURE 1 VOL 40S 114 DECEMBER 20001 www.nature.com 801
In comparative genome analysis synteny blocks = regions
containing the homologous genes
Below: Segmental duplications in the Arabidopsis genome
fund using program MUMer.
How many rounds of genome
• Two round of genome duplication
should lead to occurrences of groups
of four synteny blocks
• Such tree should be then observed in
the current genome
• They should be consistent
• For vertebrates evolution there is
evidence for full genome duplication
A B C D
Whole genome duplications in yeast
• Find syntheny blocks
• Find overlaps in syntheny blocks
• Use duplicate synteny blocks do define “sister”
regions in S. cerevisiae (145 sister regions covering
88% of the genome)
Some lessons from whole genome
alignment of closely related species
Neutral evolution/natural selection
• natural selection: a process by which biological populations are
altered over time, as a result of the propagation of heritable traits
that affect the capacity of individual organisms to survive.
– responsible for organisms being adapted to their environment.
– The theory of natural selection was proposed by Charles Darwin and Alfred
Russel Wallace in 1858, though vaguer and more obscure formulations had
been arrived at by earlier researchers.
• neutral theory of evolution (Kimura 1960):
– vast majority of molecular differences are selectively neutral.
– these genome features are neither subject to, nor explicable by, natural
– most evolutionary change is the result of genetic drift acting on neutral alleles.
Through drift, these new alleles may become more common within the
population. They may subsequently decline and disappear, or in rare cases
they may become fixed--meaning that the substitution they carry becomes a
universal feature of the population or species
• The neutralist-selectionist debate – which is the prevalent evolutionary force?
Comparative Genome analysis tools
Assume two closely related organisms (closely for this
purpose is that probability of a back substitutions A!X!A are
unlikely: example muse/rat; human chimpanzee)
KA - #of coding base substitutions that results in amino-
KS - of coding base substitutions that do not results in
amino-acid change (synonymous substitution rate)
KA/ KS – measure of evolutionary constraints
KA/ KS 1; possible adaptive or positive selection
KA / K S ratio
Comparison mouse/rat human/chimpanzee
Initial sequence of the chimpanzee genome and comparison with Human
genome, The Chimpanzee Genome Seque