Upload
umika
View
24
Download
0
Embed Size (px)
DESCRIPTION
Scaffolding Draft Genomes Using Paired Sequencing Data. James Lindsay, Jin Zhang, Thomas Farnham , Yufeng Wu, Rachel O’Neill, Ion Mandoiu ( University of Connecticut) Edward Bullwinkel , Hamed Salooti , Alex Zelikovsky ( Georgia State University). - PowerPoint PPT Presentation
Citation preview
Scaffolding ProblemOverview• Draft genomes are often
comprised of many contigs• Ordering and orienting the
contigs relative to each other makes genomes more useful
• Accomplished using paired end or mate pair reads which map to different contigs
• When pairs map to different contigs, they induce an order and orientation between contigs
• Existing tools were designed to work with Sanger reads (~1000bp)
• Long reads ensures accurate mapping
• Next gen reads are comparatively short and mapping errors are frequent
• This work presents a method to scaffold draft genomes using paired next gen reads
Legend:blue arrow: paired end reads red,green,orange lines: contigsred,green,orange arrow: oriented contigs
Mapping and Filtering• Read Mapping
– Use existing short read alignment tools to map data onto contigs– Tool must be able to report multiple hits for each read and
generate SAM output• Read Filtering
– Remove pairs including reads that do not map uniquely• “Uniqueness” depends on mapping parameters
– Remove pairs including reads that map within repeats• Repeats annotated by RepeatMasker/RepeatModeler
– Remove pairs with reads mapped in different contigs if insert size implied by mapping is unlikely • Read pair removed if minimum insert size implied by mapping longer
than expected insert size by 3 standard deviations.
r e p e a t2000 bp
Legendblack line: contigs; colored arrows: reads; braces: annotationsThe blue reads are filtered because of non uniqueness, green because of repeats, orange (assuming insert size=1395, std dev=250) because of minimum insert size
500 bp
Contig Graph and Orientation ILP
Contig graph • Contigs are vertices• Edges defined using
redundancy parameters r and d– Read pairs between contigs i and j
can be divided in two classes: consistent with the contigs’ current orientation, and consistent with switching the orientation of one contig
– Contigs i,j are connected if:• There is at least r pairs with reads
mapped between i and j• The ratio between the sizes of
larger and smaller consistency classes exceeds d
Integer Linear Program (ILP)• Typically not possible to find
a contig orientation consistent with all read pairs
• The following integer linear program (ILP) finds a contigorientation that minimizes number of inconsistent pairs
• 0/1 variable Si indicates final orientation of contig i (Si =1 iff contig i is flipped)
• hij and uij denote # read pairs between contigs i,j that are consistent, resp. inconsistent
Experimental Setup
Contig sets• Developed a test set
consisting of contigs from chr21 of HuRef assembly, and from the de novo assembly of a subset of HuRef Sanger reads (~4x average coverage)
• True orientation of assembled contigs found by mapping them to the reference genome; we retained all contigs with at least 50% contiguous alignment
Read pair data• 480 million 50bp SOLiD
mate pairs mapped against contigs using Bowtie
• Reads mapped allowing 1 mismatch in seed, sum of Phred quality score of mismatches < 80, reported 10 best alignments
• We compared our the orientation step of algorithm to BAMBUS2 orientation tool
• Both filtered and unfiltered read pairs were given to BAMBUS2 and ILP for orientation *
*BAMBUS2 has its own repeat filtering capabilities, in order to facilitate comparison pairs given to BAMBUS2 were also filtered using our repeat annotations
Contig Graphs for Chr21 HuRef Contigsunfiltered r=2, d=2
filtered, r=2, d=2
• Visualization of contiggraphs for Chr21 HuRefcontigs generated using graphviz• Connected components consisting of less than 5 contigs not displayed• Red edges indicate at least one read pair inconsistent with current contig orientation
• HuRef contigs represent an “ideal” dataset; contigs are long and contain few errors• Filtering makes little difference in this case
Results for 5135 Chr21 Contigs from 4x Assembly
Algorithm Pair Filtering # pairs r delta Singletons Correct
OrientationIncorrect
Orientation
bambus2 No 573984 1 * 375 2824 1936
bambus2 Yes 497907 1 * 450 2903 1782
ILP No 573984 1 1
ILP No 573984 1 2
ILP Yes 497907 1 1
ILP Yes 497907 1 2
bambus2 No 573984 2 * 643 2696 1796
bambus2 Yes 497907 2 * 678 2885 1572
ILP No 573984 2 1 561 2893 1681
ILP No 573984 2 2 561 2855 1719
ILP Yes 497907 2 1 614 3522 999
ILP Yes 497907 2 2 614 3469 1052
bambus2 No 573984 3 * 771 2757 1607
bambus2 Yes 497907 3 * 811 2865 1459
ILP No 573984 3 1 671 3405 1059
ILP No 573984 3 2 671 3398 1066
ILP Yes 497907 3 1 708 3837 590
ILP Yes 497907 3 2 708 3798 629
Results for 667 HuRef Chr21 Contigs
Algorithm Pair Filtering # pairs r delta Singletons Correct
OrientationIncorrect
Orientation
bambus2 No 231257 1 * 50 350 267
bambus2 Yes 166571 1 * 54 333 280
ILP No 231257 1 1 44 377 246
ILP No 231257 1 2 44 377 246
ILP Yes 166571 1 1 52 390 225
ILP Yes 166571 1 2 52 435 180
bambus2 No 231257 2 * 101 381 185
bambus2 Yes 166571 2 * 139 371 157
ILP No 231257 2 1 78 585 4
ILP No 231257 2 2 78 587 2
ILP Yes 166571 2 1 99 567 1
ILP Yes 166571 2 2 99 567 1
bambus2 No 231257 3 * 160 359 148
bambus2 Yes 166571 3 * 231 319 117
ILP No 231257 3 1 107 556 4
ILP No 231257 3 2 107 558 2
ILP Yes 166571 3 1 145 522 0
ILP Yes 166571 3 2 145 522 0
Discussion and Future WorkHuRef Contigs• In this ideal dataset our filtering
seems detrimental to BAMBUS at r=1,2. Altering r seems to have little effect on the number of correct edges.
• The ILP is only slightly affected by filtering.
• At r=1, our ILP performs comparable to BAMBUS. Best Higher redundancy greatly improves ILP accuracy.
• The d parameter has little effect.
4x Contigs• In this more realistic dataset,
filtering consistently helps both BAMBUS and the ILP at all redundancies.
• Redundancy threshold has little effect on BAMBUS, but significant effect on the ILP.
• A higher redundancy is important, although further investigation into its relationship with read coverage is necessary.
• The d parameter is sometimes detrimental.
Future WorkOrientation is only part of the scaffolding problem; in ongoing work we are developing ordering and placement algorithms. We will also explore the effect of varying assembly and read coverage on the ability to accurately scaffold draft genomes.
Acknowledgments: This work has been supported in part by NSF awards Iis-0546457, IIS-0916401, IIS-0953563, and IIS-0916948.
Scaffolding Draft Genomes Using Paired Sequencing DataJames Lindsay, Jin Zhang, Thomas Farnham, Yufeng Wu, Rachel O’Neill, Ion Mandoiu (University of
Connecticut)Edward Bullwinkel, Hamed Salooti , Alex Zelikovsky (Georgia State University)
Contig Graphs for Chr21 4x Contigsunfiltered, r=3, d=1
filtered, r=3, d=1
• 4x contigs more representative of typical genome projects• Contigs are much shorter and they create more complex structures• Unfiltered graph has more edges inside a highly interconnected component
• Filtering separates some linear chains of contigs from the large interconnected component