Whole Genome Shotgun Assembly

3. Lecture WS 2003/04

Bioinformatics III 1

Whole Genome Shotgun Assembly

Two strategies for sequencing:

clone-by-clone approach

whole-genome shotgun approach

(Celera, Gene Myers).

Shotgun sequencing was

introduced by F. Sanger et al.

(1977) and has remained the

mainstay of genome sequence

assembly for nearly 25 years now.

ED Green, Nat Rev Genet 2, 573 (2001)



Automatic sequencing

W hole genomeBAC/cosm id clone

f in a l con sen sus seq u en ce

Finishingq u a lity

b o th s ta n ds covera geg a p f illing

Partial Assem blyco n tigs

DNA sequencingra n d om clo n es

Clone libraryp U C 18

Sm all fragm ents1 .0 - 2 .0 kb

DNA fragm entationso n ic d is rup tion

n e bu liza tion

W hole genomeBAC/cosm id clone



Automated Sequencing

nearly all automatic sequencing is done using the enzymatic dideoxy chain-

termination method of Sanger (1977).

Separation of fragments by gel electrophoresis.

Readout of fragments labeled with fluorescent dyes.

Computer analysis of gel images:- lane tracking – identify gel boundaries- lane profiling – sum each of 4 signals across lane width to create a profile- trace processing – deconvolute and smooth signal estimates + reduce noise- base-calling in which the processed trace is translated into a sequence of bases.

Program Phred is quasi-standard for last step (base calling).



Base Calling - Phred

B. Ewing, L. Hillier, M.C. Wendl, P. Green Base-calling of automated sequencer traces using Phred.

I. Accuracy assessment. Genome Res 8, 175-185 (1998).

B. Ewing, P. Green. Base-calling of automated sequencer traces using Phred. II. Errror probabilities.

Genome Res 8, 186-194 (1998).

The processed traces are displayed as chromatograms of 4 curves of different color, each curve representing the signal of 1 of the 4 bases.



Base Calling - Phred

Idealized traces would

consist of evenly spaced,

nonoverlapping peaks.

Real traces deviate from

this ideal due to imper-

fections of the sequencing

reactions, of gel electro-

phoresis, and of trace

processing.

The first 50 or so peaks

and peaks over 500 or so

are particularly noisy.

Quality:high – noambiguities

medium – someambiguities

Poor – low confidence



Base Calling Algorithm

1 Locate Predicted Peaks

find the idealized locations of the base peaks using Fourier methods.

2 Locate Observed Peaks

scan 4 trace arrays for concave regions satisfying

2 v(i) v(i+1) + v(i-1)

3 Match Observed and Predicted Peaks

a) find easy matches

b) use dynamic programming to align those peaks not matched in a)

c) match remaining observed peaks that seem to represent genuine bases

4 Find missed Peaks

Phred quality values

q = - 10 log10 (p)

whereq - quality valuep - estimated probability error for a base call

Examples:

q = 20 means p = 10-2 (1 error in 100 bases)q = 40 means p = 10-4 (1 error in 10,000 bases)

Phred

Phred performs several tasks:

a. Reads trace files – compatible with most file formats: SCF (standard

chromatogram format), ABI (373/377/3700), ESD (MegaBACE) and LI-COR.

b. Calls bases – attributes a base for each identified peak with a lower error

rate than the standard base calling programs.

c. Assigns quality values to the bases – a “Phred value” based on an error

rate estimation calculated for each individual base.

d. Creates output files – base calls and quality values are written to output

files.



whole genome assembly: problem description

The goal is to reconstruct an unknown source sequence (the genome) on

{A, C, G, T} given many random short segments from the sequence, the

shotgun reads.

A read is a subsequence of nucleotides of length around 500, taken from a

random place in the genome.

The orientation of the read is either forward or reverse complement.

Reads contain two kinds of errors: base substitutions and indels.

Base substitutions occur with a frequency of ca. 0.5 – 2%.

Indels occur roughly 10 times less frequently.

Reads can come from short plasmid inserts (2-12 kb), cosmids (40 kb)

or BACs (150 kb).

Batzoglou PhD thesis (2002)



Whole Genome Assemblers

TIGR Assembler G.G. Sutton et al., Genome Sci Technol 1, 9-19 (1995)

PHRAP P. Green (1996)

Celera Assembler

CAP3 X. Huang, A. Madan, Genome Res 9, 868-877 (1999)

RePS J. Wang et al. Genome Res 12, 824-831 (2002)

Phusion (Sanger) J.C. Mullikin, Z. Ning, Genome Res 13, 81-90 (2003)

Arachne (Whitehead/MIT)

Euler (UCSD, USC) P.A. Pevzner, H. Tang, M.S. Waterman, RECOMB (2001)

most assemblers follow the same approach:

overlap – layout - consensus



CAP3 Assembler

Removal of poor end regions of reads

Computation of overlaps between reads

Removal of false overlaps

Construction of contigs

Construction of multiple sequence

alignments and generation of

consensus sequences



CAP3: Clipping of Low-Quality Regions

Use base quality values (from Phred) and sequence similarities to

compute 5‘ and 3‘ clipping positions of reads.

Definition of good regions of a read:

- any sufficiently long region of high-quality values that is similar

to a region of another read OR- any sufficiently long region that is highly similar to a good high-quality

region of another read

Computation of the 5‘ and 3‘ clipping positions of read f. Read f has high localsimilarities to reads g and h. A pair of broken lines shows the start and endpositions of a similarity. A thick line indicates the high quality region of a read.

Huang, Madan, Genome Res 9, 868 (1999)



Celera – compartmentalized shotgun assembler

use preliminary data from both

human genome assembly projects

Huson et al. Bioinformatics 17, S132 (2001)



Arachne program

by Serafin Batzoglou (MIT, PhD thesis 2000)

(i) create graph G of overlaps between pairs of reads of shotgun data

(ii) process G for the purpose of constructing supercontigs of mapped reads.

Batzoglou et al. Genome Res 12, 177 (2002)



Earmuff links

An important variation of whole-genome shotgun sequencing obtains

reads from both ends of an insert, forward and backward.

Since inserts are size-selected, the approximate distance of the pair

of reads obtained from the ends of a fragment is known.

These will be called earmuff links.



Arachne: creation of overlap graph

List of reads R = (r1, ..., rN) , N is number of reads.

Each read ri has length li < 1000.

If both reads are taken from the endpoints of the same clone (earmuff link)

ri has link to another read rj at specified distance dij.

First: create graph G of overlaps (edges) between pairs of reads (nodes).

Pairs of reads in R need to be aligned.

Since R can be very long, N2 alignments are infeasible.

Create table of occurences of k-mers (k long strings) in the reads,

count the number of k-mer matches for each pair of reads.

Then perform pairwise alignments between pairs of reads that contain

more than a cutoff number of common k-mers.




Arachne: table of k-mer occurrences

Find number of k-mer matches in the forward or reverse complement direction

between each pair of reads in R.

(1) Obtain all triplets (r,t,v)

r = read in R

t = index of a k-mer occuring in r

v = direction of occurrence (forward or reverse complement)

(2) sort the set of pairs according to k-mer indices t

(3) use sorted list to create table T of quadrublets (ri, rj, f, v) where ri and ri are

reads that contain at least one common k-mer, v is a direction, and f is the

number of k-mers in common between ri and rj in direction v.






Here:k = 3




If a k-mer occurs „too often“ likely part of a repeat sequence,

we should not use it for detecting overlap.

Implementation

(1) find k-mer occurences (r,t,v) and sort into 64 files according to the

first three nucleotides of each k-mer.

(2) For i=1,64

load file in memory, sort according to t, store sorted file.

end

(3) load 64 sorted files in memory sequentially, create table T incrementally.

In practice, k = 8 to 24.




Arachne: pairwise read alignments

Perform pairwise alignments between reads that contain more than a cutoff

number of common k-mers. When excluding those k-mers that are too common

(larger than a second) cutoff it is guaranteed that only O(N) number of pairwise

alignments will be performed.

Only a small number of base substitutions and indels is allowed in an

overlapping region of two aligned reads.

Use dynamic programming alignment that disallows deviations of more than

a few characters.

Output of the alignment algorithm:

for reads ri, rj quadrublets (b1, b2, e1, e2) of beginning b1, b2 and end e1,

e2 positions of the detected overlap region.

If a significant overlap region is detected (ri, rj, b1, b2, e1, e2) becomes a link

in the overlap graph G.




Correcting errors in reads


Shown is a portion of a multiple alignment between 5 reads. A base T of quality 30 is aligned to bases C, some of which are of quality greater than 30. The base T is subsequently changed to a base C of quality 30.



Partial alignments

3 partial alignments of length

k=6 between a pair of reads

coalesce to yield a single full

alignment of length k=19.

Vertical bars denote matching

bases, whereas x‘s denote

mismatches. This illustrates

the commonly occurring

situation where an extended k-

mer hit is a full alignment

between two reads.




Ambiguity created by the presence of repeats

In the absence of sequencing

errors and repreats it would be

simple to retrieve all retrievable

pairwise distances of reads

and to construct G.

In the presence of repeats a

link between two reads in G

does not necessarily imply true

overlap. A „repeat link“ is a link

in G between two reads that

come from different regions in

the genome, and overlap in a

repeated segment.




Arachne: processing of overlap graph

Some of the repetition in the genome is efficiently masked before the creation

of G by throwing away k-mers of high frequency when building T.

Furthermore some heuristic algorithms are used to detect and delete

repetitive links (not discussed here).




Merging contigs


Sequence contigs are formed by merging together pairs of reads that can be

merged without ambiguity.

In practice the situation

is much worse than shown

here. Repeats are not

100% conserved between

copies.



Sequence contigs




Using paired pairs of overlaps to merge reads

Arachne searches for instances of

two plasmids of similar insert size

with sequence overlaps occurring at

both ends paired pairs.


(A) A paired pair of overlaps. The top

two reads are end sequences from one

insert, and the bottom two reads are end

sequences from another.

The two overlaps must not imply too

large a discrepancy between the insert

lengths.

(B) Initially, the top two pairs of reads

are merged. Then the third pair of

reads is merged in, based on having

an overlap with one of the top two left

reads, an overlap with one of the top

two right reads, and consistent insert

lengths. The bottom pair is similarly

merged.Bottom: collection of paired pairs are

merged into contigs, and

consensus sequences are formed.



Detection of repeat contigs

Contig R is linked to contigs A and

B to the right. The distances

estimated between R and A and

R and B are such A and B cannot

be positioned without substantial

overlap between them. If there is

no corresponding detected overlap

between A and B then R is

probably a repeat linking to two

unique regions to the right.


Some of the identified contigs are repeat contigs in which nearly identical

sequence from distinct regions are collapsed together. Detection by

(a) repeat contigs usually have an unusually high depth of coverage.

(b) they will typically have conflicting links to other contigs.

After marking repeat contigs, the remaining

contigs should represent the correctly

assembled sequence.



Supercontig creation and gap filling

(A) A supercontig is constructed by successively

linking pairs of contigs that share at least two

forward-reverse links. Here, 3 contigs are

joined into one supercontig.

The layout now consists of a number

of supercontigs with interleaved gaps.

Most gaps belong to regions marked

as repeat contigs, some correspond

to regions of insufficient shotgun reads.

(B) Arachne attempts to fill gaps by using paths of

contigs. The first gap in the supercontig

shown here is filled with one contig, and the

second gap is filled by a path consisting of two

contigs.Batzoglou et al. Genome Res 12, 177 (2002)

Unmarked contigs = unique contigs.

Iteratively merge contigs into supercontigs.



Contig assembly

If (a,b) and (a,c) overlap, then

(b,c) are expected to overlap.

Moreover, one can calculate that

shift(b,c)=shift(a,c)-shift(a,b).

A repeat boundary is detected

toward the right of read a, if there

is no overlap (b,c), nor any path

of reads x1, ..., xk such that (b,x1),

(x1,x2) ..., (xk,c) are all overlaps,

and shift(b,x1) + ... + shift(xk,c)

shift(a,c) – shift(a,b).




Consistency of forward-reverse links

(A) The distance d(A,B) (length of

gap or negated length of

overlap) between two linked

contigs A and B can be

estimated using the forward-

reverse linked reads between

them.

(B) The distance d(B,C) between

two contigs B,C that are

linked to the same contig A

can be estimated from their

respective distances to the

linked contig.




Types of misassemblies

(A)3 types of simple minor misas-

semblies are shown: insertions,

deletions, and hanging ends. In all

cases, a contiguous segment (of a

contig ore the genome) of less than

10 kb does not align in the expected

location (with the genome or contig).

(B) More misassemblies.

First, two pieces of a contig align to

distant parts of the genome.

Second, adjacent contigs in a

supercontig are aligned to distant

parts of the genome.




Filling gaps in supercontigs

(A) Contigs A and B are connected by

a path p of contigs X1,..., Xk. The

distance dp(A,B) between A and B

(along the path p) is the length of

the sequence in the path that does

not overlap A and B.

(B) Contigs Y1 and Y2 share forward-

reverse links with the supercontig

S. These links position them in the

vicinity of the gap between A and

B. Therefore, Y1 and Y2 will be

used as possible stepping points in

the path closing the gap from A to

B.




Detection of chimeric reads

Reads l1, l2, l3, r1, r2, and r3, and the absence of a read n (having long overlaps on

both sides of a point x) suggest that read c may be chimeric, consisting of the

juxtaposition of two disparate genomic segments: one corresponding to the part

of c before x, and one corresponding to the part of c after x.

Note that reads l3 and r3 extend slightly beyond x, as often happens for real

chimeric reads.




Contig Coverage and Read Usage




Characterization of Contigs




Characterization of Supercontigs




Base Pair Accuracy




Misassemblies




Computational Performance




Contig Coverage and Read Usage




Comparison of different assemblers

Pevzner, Tang, Waterman PNAS 98, 9748 (2001)

you should look out for:- smallest number of contigs + misassembled contigs- highest possible coverage by contigs- lowest possible coverage by misassembled contigs



There is no error-free assembler to date

Pevzner, Tang, Waterman PNAS 98, 9748 (2001)

Comparative analysis of EULER, PHRAP, CAP, and TIGR assemblers (NM sequencing project). Every box corresponds to a contig in NM assembly produced by these programs with colored boxes corresponding to assembly errors. Boxes in the IDEAL assembly correspond to islands in the read coverage. Boxes of the same color show misassembled contigs. Repeats with similarity higher than 95% are indicated by numbered boxes at the solid line showing the genome. To check the accuracy of the assembled contigs, we fit each assembled contig into the genomic sequence. Inability to fit a contig into the genomic sequence indicates that the contig is misassembled. For example, PHRAP misassembles 17 contigs in the NM sequencing project, each contig containing from two to four fragments from different parts of the genome.

„Biologists "pay" for these errors at the

time-consuming finishing step“.



What comes next? Finishing the genome

Usually, the assembly of shotgun data is finished with a number of contigs

with some remaining gaps.

Also, within each contig there are some regions of high error rate.

The goal of the finishing phase is then to get a single continuous contig

with low error rate.

„Finishers“ apply ad hoc rules to decide where additional data is necessary.

This experimental data may then be generated in experiments using

different chemistry or higher coverage.

Autofinish (phrap group) is a program to help humans with deciding

which new reads to get.



Human experts are only rarely needed ...

D. Gordon, C. Desmarais, P. Green, Genome Res, 11, 614 (2001)

Documents

Whole Genome Shotgun Assembly