Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
De novo repeat classification and fragment assembly:
from de Bruijn to A-Bruijn graphs
IS30 repeat family
Repeats in Bacterial Genome: IS30 repeat family in N. meningitidis
Edge label: 130(2) means length 130bp and multiplicity 2
De novo repeat classification and fragment assembly with A-Bruijn
graph approach
Pavel Pevzner1, Haixu Tang2, Glenn Tesler3
1Department of Computer Science and Engineering, UCSD2Department of Computer Science, University of Indiana
3Department of Mathematics, UCSD
Adapted from http://www.hhmi.org/research/investigators/eichler.html
Duplication landscape of 2p11
Mosaic Arrangements in Segmental Duplications in Human Genome
Two-step model of segmental duplications (Evan Eichler, 1997)
Adapted from Horvath et al 2005
Ancestral duplication units are called duplicons
Mosaic repeats structure: an imaginary example
A B C D E F G H I J
A B C D E F G H I JC
A B C D E F G H I JC B C D
A B C D E F G H I JC B C D F GC
• The mosaic structure of segmental duplications in human genome is revealed using the A-Bruijn graph approach:Jiang, Tang, She, P.P, Eichler. Evolutionary reconstruction of segmental duplications
reveals punctuated cores of human gene innovations (Nature Genetics, 2007)
The Segmental Duplication in Human Genome: The Ancestral Duplicon Problem
• The duplicon problem: can we identify duplicons?• Case-by-case studies:
* Jackson et al 1999; …, Stankiewicz et al 2004; Horvath et al 2005• All duplicons: Hard
* Eichler group •only had limited success with chr22 (Bailey et al 2002),
• The mosaic structure of segmental duplications in human genome is revealed using the A-Bruijn graph approach:
Zhaoshi Jiang, Haixu Tang, Xinwei She, PP, Evan Eichler. Evolutionary reconstruction of segmental duplications reveals punctuated cores of human gene innovations (Nature Genetics, in press)
The Duplicon Problem• Tang et al 2007 proposed a computational method for identifying all duplicons.
Adapted from Tang et al 2005
Repeat Classification
• Repeat representation as a mosaic of sub-repeats
• Detailed study of each sub-repeat and its further classification into repeat sub-families
Mosaic Structure of Segmental Duplication in Human Chromosome 22
Bailey et. al. 2002, Am. J. Hum. Genet.
Algorithmic Challenge
• Problem: find all repeat elements and reveal the sub-repeat mosaic structure.– Perfect repeats: de Bruijn graph, suffix tree.– Imperfect repeats: OPEN PROBLEM.
• Goal: Generalize the de Bruijn graph for imperfect repeats.
De Novo Repeat Classification: messy and ill-defined problem
“The problem of automated repeat sequence family classification is inherently messy and ill-defined and does not appear to be amenable to a clean algorithmic attack.”
Bao & Eddy 2002, Genome Research
De Novo Repeat Classification: messy and ill-defined problem
“The problem of automated repeat sequence family classification is inherently messy and ill-defined and does not appear to be amenable to a clean algorithmic attack.”
Bao & Eddy 2002, Genome Research
De Novo Repeat Classification
Library of repeat elements
Element 1 AGCCTACG … …
… …
Element 2 TGCATTTT … …Element 3 GAACTCAC … …
De novo compilation
?
Reputer
Pairwise similarity
Similarity matrix
A B C D E F G H I JC B C D F GC
De novo Repeat Classification: Previous Studies
1. RepeatFinder: Volfovsky et. al. 20012. RECON: Bao & Eddy, 2002
– Heuristic algorithms that work well for real data
– Do not reveal the mosaic structure
Annotating Known Repeat Elements
Library of (known) repeat elements
Element 1 AGCCTACG … …
… …
Element 2 TGCATTTT … …Element 3 GAACTCAC … …
De novo Repeat Classification Problem: Given a newly sequenced genome, find all repeat elements (sub-repeats) in this genome.
Genome
RepeatMasker
Mosaic Structure of Repeats: A Real Example(Clone G12 from Human Chromosome Y)
8328 140 628 1185 2905 381 161442628 1185 140 628 1185 381 140 628
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
RECON (Bao and Eddy, 2002) does not reveal the mosaic:
A-Bruijn representation 2 copies
3 copies
2 copies
4 copies
?
De Bruijn Graph: Applications in Bioinformatics
• Sequencing by hybridization (Pevzner, 1989)• Re-sequencing with DNA arrays (Shamir and
Tsur, 2001, Peer et al., 2002)• Fragment assembly (Idury and Waterman, 1995,
Pevzner et. al. 2001)• EST analysis (Heber, et. al., 2002)• Computational mass-spectrometry (Bocker, 2003,
Bandeira et al., 2007)
De Bruijn Graph: Classification of Perfect Repeats
ABCDEFCGHBCDIFCGJ
Vertices: (k-1)-mers from the sequence
AB BC CD DE EF FC CG
GHHB
DI IF
GJ
Every sub-repeat is represented as a repeat edge in the graph.
BCD FCG
Edges: k-mers from the sequence
Repeat Gluing
y y
x
xy
y
x y x y
x y
xy y
x y
Repeat Gluing
y y
x
xy
y
x y x y
x y
xy y
x y
gluing instruction
Similarity matrix
A B C D E F G H I JC B C D F GC
A B C D E F G H I JC B C D F GC
CA B
FG
D
I
E
HJ
Sub-repeats:edges in the
repeatgraph
2 copies
2 copies
2 copies
2 copies4 copiesC
FB
D G
repeat graph
In reality, repeats are usually imperfect
8328 140 628 1185 2905 381 161442628 1185 140 628 1185 381 140 628
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
… … AG-CCATCGACGTCACC … …… … AGTGCCTCG-CGTCTCC … …
Similarity matrix
A B C D E F G H I JC B C D F GC
y y
x
x
Inconsistent Gluing
y y
x
x
Consistent Gluing
Challenge: Generalize the Notion of De Bruijn Graph for Imperfect Repeats
• Input– a genomic sequence– all significant local pairwise alignments
• Output– repeat graph representing all repeats as a
mosaic of sub-repeats
A-Bruijn Graph Construction
… acat … acgt … ccat …Genomic sequence
a c a ta c g t
c c a t
A-graph
acatacgt
1 acgtccat
2 acatccat
3
Pairwise localalignment
cA-Bruijn graph
a,c a,g t
… a c a t … a c g t … c c a t …… a c a t… a c g t… c c a t…
Similarity matrix
A-Bruijn Graph Construction: Bulges
… at … act … acat …Genomic sequence
a ta c t
a c a t
A-graph
c
A-Bruijn graph
a a t
a-tact
1 ac-tacat
2 a--tacat
3
Consistent pairwise localalignment
Similarity matrix… a t … a c t … a c a t …
… a t… a c t… a c a t…
A-Bruijn Graph Construction: Whirls
… at … act … acat …Genomic sequence
a ta c t
a c a t
A-graph
c
A-Bruijn graph
a t
a-tact
1 ac-tacat
2 --atacat
3
Inconsistent pairwise localalignment
Similarity matrix… a t … a c t … a c a t …
… a t… a c t… a c a t…
8328 140 628 1185 2905 381 161442628 1185 140 628 1185 381 140 628
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
y y
x
x
repeat graph
Repeat Graph
A-Bruijn graph
A-Bruijn graph
repeat graph
Problem: Simplifying A-Bruijn Graph
Removing Bulges and Whirls: Solving MSLG Problem
Maximum Subgraph with Large Girth (MSLG) Problem:
Input: a graph;Output: a maximum weight subgraph that does not contain short cycles, i. e. cycles of length less than a parameter girth.
Solution known only when the girth is infinite -- Maximum Spanning Tree Problem (maximum weight acyclic subgraph).
NP-hard problem (Skiena, 2002).
Minimum (or Maximum) Spanning Trees
• The first algorithm for finding a MST was developed by Boruvka in 1926 to minimize the cost of electrical coverage in Bohemia.
• The MST Problem– Connect all of the cities using the
least amount of wire possible
Solution: Maximum Spanning Tree Approximation to MSLG Problem
RepeatGluer AlgorithmA-Bruijn graph
Bulge andwhirl removal
a c b d,h e,g,i f,j
Zigzag path straightening
a b
c
d e f
gh
i
j
Erosion
IS30 repeat family
Repeats in Bacterial Genome: IS30 repeat family in N. meningitidis
Edge label: 130(2) means length 130bp and multiplicity 2
The Repeats Graph of N. meningitidis
Repeats in Human Genome (ALU)
Edge label: 17(80-127) means length 17bp and multiplicity between 80-127
Multiple Alignment = Finding Repeats in a Concatenate of all Sequences
Raphael et al., 2004 (Genome Research)
Fragment Assembly in Genome Sequencing
• Genomes are very long (human genome has 3 billions base-pairs)• Current technology can only reliably “read” a short DNA fragment
(read, typically 500 – 1000 base pairs)• Whole genome shotgun sequencing: break genome into millions of
overlapping pieces (read) and sequence each read• Fragment Assembly: assembly of the genome from millions of
overlapping reads• Celera Genomics has assembled human genome (2001)• Over 100 large genomes are waiting for assembly• Current assembly programs may make mis-assembly
– Recent mammalian genome assembly, e.g. mouse, rat, etc, are estimated to have thousands of mis-assemblies
• Repeats in the genomic sequence are the main cause of mis-assembly
Fragment Assembly Using Repeat Graph
Reads
GenomeA B C D E F G H I JC B C D F GC
A B C D I F G H E JC B C D F GC
CA B
FG
D
I
E
HJ
repeat graph
Every possible genome reconstruction corresponds to an Eulerian path in the repeat graph.
Fragment Assembly = Building Repeat Graph from Concatenated Reads
Key idea: The repeat graph built from concatenated reads is identical to the repeat graph built from genomic sequence if the reads “cover” the genomic sequence.
Similarity matrix
… 1 0 … 1 0 0 … 0 0 1 0 …… … … … … … … … … … … … …… 0 1 … 0 0 1 … 0 0 0 1 …… … … … … … … … … … … … …… 0 0 … 0 1 0 … 0 1 0 0 …… 1 0 … 1 0 0 … 1 0 0 0 …… 0 1 … 0 0 1 … 0 0 0 1 …… … … … … … … … … … … … …… 1 0 … 1 0 0 … 1 0 0 0 …… 0 0 … 0 0 0 … 0 0 1 0 …… 0 0 … 0 1 0 … 0 1 0 0 …… 0 1 … 0 0 1 … 0 0 0 1 …… … … … … … … … … … … … …
Fragment Assembly: Building Repeat Graph from Reads
ab
ca
b
c
Fragment Assembly: Building Repeat Graph from Reads
x yrepeat graph
ab
c
Snapshots of similarity matrixa
b
c
FragmentGluer Algorithm (outline)
• Concatenate reads (in an arbitrary order!) into a single sequence
• Compute the similarity matrix for this concatenated sequence (overlap detection between reads)
• Use this matrix as a “glue” to build the repeat graph with the RepeatGluer algorithm
FragmentGluer algorithm
• Identify and remove chimeric read;• Concatenate the remaining reads into a sequence and compose
similarity matrix A from the pairwise alignments of reads;• Construct the A-Bruijn graph from the similarity matrix A;• Remove bulges and whirls;• Thread each read through the resulting graph and form the consensus
sequence from reads;• Define the coverage of a vertex in the graph as the number of reads
that are threaded through this vertex;• Define coverage of simple paths as average coverage of their vertices;
FragmentGluer algorithm (cont’d)
• Form the repeat graph by collapsing simple paths in the graph;• The consensus sequence of an edge in the repeat graph is defined as
the consensus sequence of the corresponding simple path;• Output repeat families as tangles in the repeat graph;• Every tangle is a collection of edges (sub-repeats) with corresponding
consensus sequences;• Transform mate-pairs into mate-paths in the graph obtained and
perform equivalent transformations on the resulting set of mate –paths;• Use mate-pairs to resolve differences between nearly identical copies
of repeats;• Define contigs as consensus sequences of simple paths in the resulting
graph;• Assemble the resulting contigs into scaffolds by the EULER
Scaffolding algorithm
Benchmarking EULER+, Phrap and Arachne on all BACs from Human Chromosome 20
•EULER+ produced the least number of misassembled contigsMisassembled contigs by Phrap: 37Misassembled contigs by ARACHNE: 17Misassembled contigs by EULER+: 7
•EULER+ also had the least number of collapsed repeat copies (4), ahead of Phrap (5) and Arachne (9).
•Average number of contigs per BAC was the least for EULER+ (6.2) followed by Phrap (6.8) and ARACHNE (13.8).
Pevzner et al., 2004 (Genome Research)
Comparison of Euler and Newbler Assemblies
Genome Assembler No. contigs N50 Misassembled Coverage Net size
E. coli Euler 199 46887 3 94.7 4277
Newbler 141 60757 0 99.1 4531
Repeat graph 94 125693 - 92.1 4560
S. pneumoniae Euler 127 32619 1 96.8 2001
Newbler 253 11905 0 95.0 2000
Repeat graph 136 36004 - 97.5 2091
Comparison of Euler and Newbler Assemblies(without flowgrams)
Genome Assembler No. contigs N50 Misassembled Coverage Net size
E. coli Euler 199 46887 3 94.7 4277
Newbler 311 28475 0 99.1 4531
Repeat graph 94 125693 - 92.1 4560
S. pneumoniae Euler 127 32619 1 96.8 2001
Newbler 253 11905 0 95.0 2000
Repeat graph 136 36004 - 97.5 2091
454 Reads+Sanger Reads
0
10,000
20,000
30,000
40,000
50,000
1x 1.5x 2x 2.5x 3x 3.5x 4x 4.5x 5x
S. pneumoniae S. pneumoniae, 3kb matesS. pneumoniae 30x 454
Sanger coverage
Acknowledgements
• Evan Eichler, Genome Sciences, University of Washington
A B C D E F G H I JC B C D F GC
CA B
FG
D
I
E
HJ
Sub-repeats:tangle edges
2 copies
2 copies
2 copies
2 copies4 copiesC
FB
D G
repeat graph
8328 140 628 1185 2905 381 161442628 1185 140 628 1185 381 140 628
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 2 3 4
56 7
9
8
10
1411
12
15
141 2,9,13 3,6,10,14 4,7,11
5
8,12
15 1614428328 140 628 1185
2905
381
repeat graph
Sub-repeats:tangle edges
3 copies
3 copies
2 copies
4 copies
Mosaic repeats structure: an imaginary example
A B C D E F G H I J
A B C D E F G H I JC
A B C D E F G H I JC B C D
A B C D E F G H I JC B C D F GC
Mosaic structure of repeats: a real example(BAC G12 from human chromosome Y)
8328 140 628 1185 2905 381 161442628 1185 140 628 1185 381 140 628
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Mosaic structure of repeats
A B C D E F G H I JC B C D F GC
Consensus repeat elements
Repeat boundary problem(Bao & Eddy, 2002)
Mosaic representation 2 copies
2 copies
2 copies
2 copies4 copies
A-Bruijn graph approach: an overview
All local alignments (genomic dot plot)
A-Bruijn Graph
Repeat Graph
Removing bulges and whirls
Gluing
Bacterial genomes assembly
All 5 bacterial genomes were mis-assembled by Phrap (either repeat collapsing or joining non-contiguous regions or both). ARACHNE and EULER+ make no assembly errors except for one genome.