The Changing Face of Sequencing Strategies for de novo sequencing of complex genomes

The Changing Face of Sequencing

Strategies for de novo sequencing of complex genomes

Quick Review:

BACs

Whole Genome Shotgun

First some history…. 2000: ArabidopsisBAC

BAC & WGS

BAC

WGS

WGS

WGS

WGS

2005: Rice

2006: Poplar

2007: Grapevine

2008: Maize

2008: Papaya

2009: Sorghum

BAC-based vs WGS

BAC-by-BAC WGS

Pros• Simpler more accurate assembly• Localized sequence• Easily distributed• Can be targeted to regions

• Physical map not needed, but helps• Logistically simple• Low library costs• Rapid

Cons• Requires physical map• Labor intensive• Expensive (more libraries)• Slower

• Complex assembly• Harder to localize sequence• Requires centralized assembly• Whole genome or nothing

What made WGS possible?

Long, high quality Sanger reads (700-800bp) Paired-end libraries Range of insert sizes

3kb 8-10kb 40kb fosmids

Assemblers tailored to these datatypes. Still not guaranteed…

public maize project went BAC by BAC

NGS changes all the rules

Quantity not quality is now the focus New platforms generate huge quantities of data Read length & PE’s initially limited de novo apps

Rapid cycle of improvements No time for standard approaches to spread beyond

genome centers before next cycle begins. Third party software sometimes slow to catch up

Cost model has changed Library construction used to be minor component of cost Unit used to be 96 or 384 reads…..

Choice is now more complex than BAC vs WGS

does notOne size^fits all

Every project has individual needs Monolithic reference genome is rarely needed now How bad are the repeat structures? Is it important to get them right? How important is it to anchor all the sequence to a

genome location? What other genome data can be leveraged?

BACs and NGS – the problem

Pre-NGS: To sequence a BAC:

Make 1 sequencing library ~$50-100 Sequence two 384-well plates of clones ~$750 ~6x coverage

With NGS: To sequence a BAC with 454:

Make 1 sequencing library ~$300 Sequence 1/8 plate of 454: ~$1,000 ~600x coverage

Too expensive, and too much coverage…..

New BAC-based approaches

One library per BAC is cost-prohibitive

Map-based BAC pooling Retain some of the assembly benefits of BACs Reduced library costs over BAC-by-BAC If contiguous, retains the genome localization benefits

BAC pooling strategyChr3. shortarm

FPC contigs

Selected BACs

Contigs from individual BAC

pools

Scaffolds from individual BAC

pools

Superscaffolds spanning poolboundaries

Select FPC contigs on the shortarm

Select overlapping BACs and bin them into 3Mb pools

Pyrosequencing of BAC pools and assembly of raw sequences

Contigs are organized into scaffolds using 454 paired end sequences

Generate superscaffolds using BAMBUS and BAC end sequences

3 Mb pools

~20x 454 TitaniumReads (~400bp each)

454 FLX PE’s (~250bp each)

Use BAC ends for very long scaffolds

From Rounsley et al. (2009)

Results: Chr3S of Oryza barthii

6 x 3Mb BAC pools1 Titanium Run0.5 FLX Run

~$12k in reagents

Contig N50: 14.3 kbScaffold N50: 370.9 kbScaffold N50: 3,165.1 kb(after BAC ends)

Nt Accuracy: 2.2 errors per 10kb

2D pooling: An alternative to contiguous BAC pools

• Place ordered clones in plates• 1 Library from each row• 1 Library from each column• Identify reads from each individual clone

by sequence overlap.• Then assemble each clone

Assembly unit reduced to ~ single BAC Library cost drops with size of grid

10x10: 100 clones, 20 libraries 50x50: 2500 clones, 100 libraries

3D grid lowers cost even further 10x10x10: 1000 clones, 30 libraries 20x20x20: 4000 clones, 60 libraries

Repeats may misbehave but can choose to ignore them

The ideal….

One library per BAC clone Barcoded Sequence all clones from BAC library in one

combined, barcoded pool

BUT: currently not cost-effective. Individual DNA preps for thousands of BAC

clones is costly

Is WGS with NGS feasible yet?

With 454: 400bp reads, + 4kb and 20kb insert PE protocols Success may be Species & Goal dependent:

Arabidopsis small & low repeat content 21kb contig N50; 2.6Mb scaffold N50 Roche & Ecker

Cassava 800Mb, lots of repeats 5.3kb contig N50; 180kb scaffold N50 Roche & JGI Missing half of the genome (repetitive half)

WGS with Solexa/Illumina Improved read-lengths, PE protocols Improved third party assemblers

e.g. SOAPdenovo, Velvet

Cucumber genome - BGI 300Mb genome 50x coverage with 50bp PE 5kb contigN50, 60kb scaffoldN50 Much better when mixed with 4x Sanger Missing half of genome (repeats)

Panda Genome - BGI 3Gb genome 50x coverage with 75bp PE 300kb contigN50 (?)

Big question: What is misassembly rate?

Building contigs from overlapping clones

Cut with R.E.

Overlapping BACs share common fragments

5 overlapping BAC clones form small contig

Building contigs from overlapping clones

• Measure lengths

Overlapping BACs will share fragments of same size

• Make sequencing lib• Sequence from each cut site

Overlapping BACs will share sequence tags next to each cut site

A BAC-WGS hybrid? whole genome profiling by Keygene

A: Solexa-based BAC map Construct BAC library; array into 2D pools Cut with restriction enzyme, and make 1 library per pool. Generate sequence from libraries Deconvolute pools to identify the Solexa reads from each BAC. Build a map from overlaps Map has short sequence tag every 1-2kb in genome

B: WGS sequencing with Solexa Assemble short contigs (high stringency) Use above map to locate each contig in genome. Map can identify misassemblies

C: Result: High quality map-based genome at fraction of cost

Simulation of Tag-based Map building

Rice: 372Mb, 12 chromosomes Simulate a 10x BAC library

28,600 clones Cut the sequence for each clone with HindIII Simulate a short read sequence from each site

2.2 million sequence tags Build a map from these – overlapping clones share tags

33 contigs built (<3 contigs per chromosome) Only 1 misassembly!

So you want to sequence a genome?

Lots of choices to make: BACs, WGS Which NGS technology? Single end, paired end? What size paired ends? What depth of coverage from each?

How do you pick? Do lots of testing of strategies - $$$$$ Guess – Free Copy what someone else did - Free Educated Guess based on Simulation

How to decide on a strategy? Simulating Genome Sequencing

“Plantagora”Plant Genome Assembly Simulation Platform

Use existing genomes to simulate sequencing reads Combine reads in many combinations Assemble Score the results with meaningful metrics Report results on web site

Summary

No longer BACs vs WGS Different ways of using BACs

Linear pooling 2D pooling

BACs for map, WGS for sequence WGS works on easy parts of genome Simulation is valuable in evaluating strategies

Documents

The Changing Face of Sequencing Strategies for de novo sequencing of complex genomes