Whole Genome Shotgun Assembly

  • View

  • Download

Embed Size (px)


Whole Genome Shotgun Assembly. Two strategies for sequencing: clone-by-clone approach whole-genome shotgun approach (Celera, Gene Myers). Shotgun sequencing was introduced by F. Sanger et al. (1977) and has remained the mainstay of genome sequence assembly for nearly 25 years now. - PowerPoint PPT Presentation

Text of Whole Genome Shotgun Assembly

  • Whole Genome Shotgun Assembly Two strategies for sequencing: clone-by-clone approach

    whole-genome shotgun approach (Celera, Gene Myers).

    Shotgun sequencing was introduced by F. Sanger et al. (1977) and has remained the mainstay of genome sequence assembly for nearly 25 years now.ED Green, Nat Rev Genet 2, 573 (2001)

  • Automatic sequencing

  • Automated Sequencing nearly all automatic sequencing is done using the enzymatic dideoxy chain-termination method of Sanger (1977).Separation of fragments by gel electrophoresis.Readout of fragments labeled with fluorescent dyes.

    Computer analysis of gel images: lane tracking identify gel boundaries lane profiling sum each of 4 signals across lane width to create a profile trace processing deconvolute and smooth signal estimates + reduce noise base-calling in which the processed trace is translated into a sequence of bases.

    Program Phred is quasi-standard for last step (base calling).

  • Base Calling - Phred B. Ewing, L. Hillier, M.C. Wendl, P. Green Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res 8, 175-185 (1998).B. Ewing, P. Green. Base-calling of automated sequencer traces using Phred. II. Errror probabilities. Genome Res 8, 186-194 (1998).The processed traces are displayed as chromatograms of 4 curves of different color, each curve representing the signal of 1 of the 4 bases.

  • Base Calling - Phred Idealized traces would consist of evenly spaced, nonoverlapping peaks.

    Real traces deviate fromthis ideal due to imper-fections of the sequencingreactions, of gel electro-phoresis, and of traceprocessing.

    The first 50 or so peaksand peaks over 500 or soare particularly noisy.Quality:high noambiguities

    medium someambiguities

    Poor low confidence

  • Base Calling Algorithm 1 Locate Predicted Peaksfind the idealized locations of the base peaks using Fourier methods.2 Locate Observed Peaksscan 4 trace arrays for concave regions satisfying 2 v(i) v(i+1) + v(i-1)3 Match Observed and Predicted Peaksa) find easy matchesb) use dynamic programming to align those peaks not matched in a)c) match remaining observed peaks that seem to represent genuine bases

    4 Find missed Peaks

  • Phred quality values q = - 10 log10 (p)

    whereq - quality valuep - estimated probability error for a base call


    q = 20 means p = 10-2 (1 error in 100 bases)q = 40 means p = 10-4 (1 error in 10,000 bases)

  • PhredPhred performs several tasks: a. Reads trace files compatible with most file formats: SCF (standard chromatogram format), ABI (373/377/3700), ESD (MegaBACE) and LI-COR.b. Calls bases attributes a base for each identified peak with a lower error rate than the standard base calling programs.c. Assigns quality values to the bases a Phred value based on an error rate estimation calculated for each individual base.d. Creates output files base calls and quality values are written to output files.

  • whole genome assembly: problem description The goal is to reconstruct an unknown source sequence (the genome) on{A, C, G, T} given many random short segments from the sequence, theshotgun reads.

    A read is a subsequence of nucleotides of length around 500, taken from arandom place in the genome. The orientation of the read is either forward or reverse complement.

    Reads contain two kinds of errors: base substitutions and indels.Base substitutions occur with a frequency of ca. 0.5 2%.Indels occur roughly 10 times less frequently.

    Reads can come from short plasmid inserts (2-12 kb), cosmids (40 kb)or BACs (150 kb).Batzoglou PhD thesis (2002)

  • Whole Genome Assemblers TIGR Assembler G.G. Sutton et al., Genome Sci Technol 1, 9-19 (1995)PHRAPP. Green (1996)Celera Assembler CAP3X. Huang, A. Madan, Genome Res 9, 868-877 (1999)RePSJ. Wang et al. Genome Res 12, 824-831 (2002)Phusion (Sanger)J.C. Mullikin, Z. Ning, Genome Res 13, 81-90 (2003)Arachne (Whitehead/MIT)Euler (UCSD, USC) P.A. Pevzner, H. Tang, M.S. Waterman, RECOMB (2001)

    most assemblers follow the same approach:

    overlap layout - consensus

  • CAP3 Assembler Removal of poor end regions of readsComputation of overlaps between readsRemoval of false overlapsConstruction of contigsConstruction of multiple sequencealignments and generation ofconsensus sequences

  • CAP3: Clipping of Low-Quality Regions Use base quality values (from Phred) and sequence similarities to compute 5 and 3 clipping positions of reads.Definition of good regions of a read: - any sufficiently long region of high-quality values that is similar to a region of another read OR any sufficiently long region that is highly similar to a good high-qualityregion of another readComputation of the 5 and 3 clipping positions of read f. Read f has high localsimilarities to reads g and h. A pair of broken lines shows the start and endpositions of a similarity. A thick line indicates the high quality region of a read.Huang, Madan, Genome Res 9, 868 (1999)

  • Celera compartmentalized shotgun assembler use preliminary data from bothhuman genome assembly projects Huson et al. Bioinformatics 17, S132 (2001)

  • Arachne program by Serafin Batzoglou (MIT, PhD thesis 2000)create graph G of overlaps between pairs of reads of shotgun dataprocess G for the purpose of constructing supercontigs of mapped reads.Batzoglou et al. Genome Res 12, 177 (2002)

  • Earmuff links An important variation of whole-genome shotgun sequencing obtainsreads from both ends of an insert, forward and backward.Since inserts are size-selected, the approximate distance of the pairof reads obtained from the ends of a fragment is known.

    These will be called earmuff links.

  • Arachne: creation of overlap graph List of reads R = (r1, ..., rN) , N is number of reads.Each read ri has length li < 1000.If both reads are taken from the endpoints of the same clone (earmuff link)ri has link to another read rj at specified distance dij.

    First: create graph G of overlaps (edges) between pairs of reads (nodes). Pairs of reads in R need to be aligned. Since R can be very long, N2 alignments are infeasible.

    Create table of occurences of k-mers (k long strings) in the reads, count the number of k-mer matches for each pair of reads.

    Then perform pairwise alignments between pairs of reads that containmore than a cutoff number of common k-mers.Batzoglou PhD thesis (2002)

  • Arachne: table of k-mer occurrences Find number of k-mer matches in the forward or reverse complement directionbetween each pair of reads in R.(1) Obtain all triplets (r,t,v)r = read in Rt = index of a k-mer occuring in rv = direction of occurrence (forward or reverse complement)(2) sort the set of pairs according to k-mer indices t(3) use sorted list to create table T of quadrublets (ri, rj, f, v) where ri and ri arereads that contain at least one common k-mer, v is a direction, and f is thenumber of k-mers in common between ri and rj in direction v.Batzoglou PhD thesis (2002)

  • Arachne: table of k-mer occurrences Batzoglou PhD thesis (2002)Here:k = 3

  • Arachne: table of k-mer occurrences If a k-mer occurs too often likely part of a repeat sequence,we should not use it for detecting overlap.

    Implementationfind k-mer occurences (r,t,v) and sort into 64 files according to thefirst three nucleotides of each k-mer.For i=1,64load file in memory, sort according to t, store sorted file.endload 64 sorted files in memory sequentially, create table T incrementally.

    In practice, k = 8 to 24.Batzoglou PhD thesis (2002)

  • Arachne: pairwise read alignments Perform pairwise alignments between reads that contain more than a cutoffnumber of common k-mers. When excluding those k-mers that are too common(larger than a second) cutoff it is guaranteed that only O(N) number of pairwisealignments will be performed.

    Only a small number of base substitutions and indels is allowed in an overlapping region of two aligned reads.Use dynamic programming alignment that disallows deviations of more thana few characters.Output of the alignment algorithm:for reads ri, rj quadrublets (b1, b2, e1, e2) of beginning b1, b2 and end e1,e2 positions of the detected overlap region.If a significant overlap region is detected (ri, rj, b1, b2, e1, e2) becomes a link in the overlap graph G.Batzoglou PhD thesis (2002)

  • Correcting errors in reads Batzoglou et al. Genome Res 12, 177 (2002)Shown is a portion of a multiple alignment between 5 reads. A base T of quality 30 is aligned to bases C, some of which are of quality greater than 30. The base T is subsequently changed to a base C of quality 30.

  • Partial alignments 3 partial alignments of length k=6 between a pair of reads coalesce to yield a single full alignment of length k=19. Vertical bars denote matching bases, whereas xs denote mismatches. This illustrates the commonly occurring situation where an extended k-mer hit is a full alignment between two reads. Batzoglou et al. Genome Res 12, 177 (2002)

  • Ambiguity created by the presence of repeats In the absence of sequencing errors and repreats it would be simple to retrieve all retrievable pairwise distances of reads and to construct G.

    In the presence of repeats a link between two reads in G does not necessarily imply true overlap. A repeat link is a link in G between two reads that come from different regions in the genome, and overlap in a repeated segment.Batzoglou PhD thesis (2002)

  • Arachne: processing of overlap graph Some of the repetition in the genome is efficiently masked before the creationof G by throwing away k-mers of high frequency when building T.

    Furthermore some heuristic algorithms are used to de