Upload
casey
View
44
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A hierarchical approach to building contig scaffolds. Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp. 149-159, 2004. Sequencing pipeline. Random sequencing un-related reads 500-700 base-pairs Assembly un-related contigs 5,000-10,000 base-pairs Scaffolding - PowerPoint PPT Presentation
Citation preview
A hierarchical approach to building contig scaffolds
Mihai PopDan Kosack
Steven L. SalzbergGenome Research 14(1), pp. 149-159, 2004.
Sequencing pipeline
• Random sequencing• un-related reads 500-700 base-pairs
• Assembly• un-related contigs 5,000-10,000 base-pairs
• Scaffolding• un-related scaffolds 30,000-50,000 base-pairs
• Finishing/gap closure• completed genomes millions-billions of base-
pairs
Scaffolding
• Given a set of non-overlapping contigsorder and orient them along a chromosome
III III IV
I
IIIII
IV
Clone-mates
Clone
Insert
F R
FR
I II
R
I
F
II
F
II
R
I
Scaffolder output
Sequencing gaps
Physical gaps
Problems with the data
• Incorrect sizing of inserts• cut from gel – sizing is subjective• error increases with size
• Chimeras (ends belong to different inserts)• biological reasons (esp. for large sized inserts)• sample tracking (human error)
• Software must handle a certain error rate.
Theoretical abstraction
• Given a set of entities (reads/contigs) and constraints between them (overlaps/mate pairs) provide a linear/circular embedding that preserves most constraints.
Graph representation• Nodes: contigs• Directed edges: constraints on relative
placement of contigs – relative order and relative orientation
• Embedding: order (coordinate along chromosome) and orientation (strand sampled)
Challenges
• Orientation – node coloring problem (forward/reverse)• feasibility – no cycles with odd number of
“reversal” edges (blue edges)• optimality – remove minimum number of edges
such that a solution exists (NP-hard)
Challenges
• Ordering – generate a linear embedding• feasibility – lengths of parallel DAG paths are
consistent• optimality – remove minimum number of edges
such that DAG is feasible (NP-hard)
The real world
• Use of scaffolds• Analysis – longest unambiguous sub-graphs• Finishing – present all “reliable” relationships
between contigs• Sources of error
• mis-assemblies• sizing errors (increases with library size)• chimeras
Ambiguous scaffold
I
II
III
I II III
I IIIII
I III I’ II
Hierarchical scaffolding
1. For each contig pair, consolidate all linking data into a single relationship – 2 correct links required
Hierarchical scaffolding
2. Use most reliable links to build scaffolds
3. Repeatedly build super-scaffolds based on less reliable linking data
Rationale
Hierarchical step
Problemcomplexity
problem size (#nodes)error rate
Linking information
• Overlaps
• Mate-pair links
• Similarity links
• Physical markers
• Gene synteny
reference genome
physical map
BAMBOO (BAMBUS)Best effort Attempt
Multiple Branches allowedOrder, Orient
Inputs
• Set of contigs: names and lengths• Groups of contig links:
• groups correspond to “quality” of links• link: relative distance between contig origins
relative orientation of contigs
• Priorities for each group – specify order in which links are considered
BE EB
min <= dist <= max
Outputs• XML representation of layout:
• contig orientations• contig position (x-coordinate of contig origin)• links used to construct layout
• Graphical display of the layout• uses GraphViz package from AT&T
1.0 release
• XML input not yet supported
• All scaffold placed in the same output file
• Only Linux executable released
• Hacker friendly
Current release: 2.33
• XML input and more general output module• Collection of input modules from common
assembly formats• Better handling of priority data• Repeat masking features• More platforms supported and source code
released as open source• http://amos.sourceforge.net
Future enhancements
• Option to generate un-ambiguous (non-branching) scaffolds
• Better layout algorithms• Specialized drawing tools• Interactive browser
• Represent/handle multiple haplotypes
Acknowledgements
• Dan Kosack• Steven Salzberg• Martin Shumway
• Hean Koo• Luke Tallon• Jessica Vamathevan