Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
h"p://pinegenome.org/pinerefseq/
United States Na4onal Ins4tute Department of of Food and Agriculture Agriculture
The Sequencing, Assembly, and Characteriza>on of a 22 Gb conifer genome, Loblolly pine
David Neale, Pieter de Jong, Chuck Langley, Dorrie Main, Keithanne Mockai>s, Steven Salzberg, Kris>an Stevens, Jill Wegrzyn, Jim Yorke, and Aleksey Zimin
Univ. of Calfornia, Davis; Children’s Hospital of Oakland Research Ins4tute; Indiana Univ.; Washington State Univ.; Univ. of Maryland; and Johns Hopkins Univ.
A truly large genome
see poster 271, Daniela Puiu
h"p://pinegenome.org/pinerefseq/
United States Na4onal Ins4tute Department of of Food and Agriculture Agriculture
Why Sequence a Conifer Genome?
• Phylogene4c Representa4on – None currently exists. The conifers (gymnosperms) are the oldest of the major plant clades, arising some 300 million years ago. They are key to our understanding of the origins of gene4c diversity in higher plants.
• Ecological Representa4on – Conifers are of immense ecological importance, comprising the dominant life forms in most of the temperate and boreal ecosystems in the Northern Hemisphere.
• Fundamental Gene4c Informa4on – Reference sequences provide the data necessary to understand conifer biology and aid in guiding management of gene4c resources.
Source: Jiao et al., Ancestral polyploidy in seed plants and angiosperms, Nature, Vol. 473, May 5, 2011
h"p://pinegenome.org/pinerefseq/
United States Na4onal Ins4tute Department of of Food and Agriculture Agriculture
Elements of the Conifer Genome Sequencing Project
h"p://pinegenome.org/pinerefseq/
United States Na4onal Ins4tute Department of of Food and Agriculture Agriculture
Plant Genome Size Comparisons
Image Credit: Modified from Daniel Peterson, Mississippi State University
0
5000
10000
15000
20000
25000
30000
35000
40000
0 1000 2000 3000 Arabidopsis
Oryza Populus Sorghum Glycine Zea
Pseudotsuga menziesii
Taxodium distichum
Picea abies
Picea glauca
Pinus taeda
1C D
NA
cont
ent (
Mb)
Pinus lambertiana
Acquiring the DNA
Haploid Haploid megagametophyte tissue 1N Shotgun sequenced
Diploid Diploid needle tissue 2N 40 Kb cloned fosmids, pooled and sequenced
Figure Credit: Nicholas Wheeler, University of California, Davis
Selec4ng a Megagametophyte
• Goal: deep (>50X) representa4ve short insert libraries from a single haploid (1N ) segregant.
• Libraries from DNA preps of 22 megagametophytes were prepared, sized and analyzed.
Most of the 4ssue in a pine seed is the haploid megagametophyte.
8
h"p://pinegenome.org/pinerefseq/
United States Na4onal Ins4tute Department of of Food and Agriculture Agriculture
Strategy for De Novo Sequencing of the Conifer Genomes
Two Complementary Approaches
Max Output: 95 Gigabases Max. paired end reads - 640 million Max. Read Length – 2 x 150 bp
Max Output: 600 Gigabases Max. paired end reads - 6 billion Max. Read Length – 2 x 100 bp
Sequencing Strategy
60X 40X clone
Figure Credit: Nicholas Wheeler, University of California, Davis
Sequencing Strategy
Today
Sequencing Strategy
End of summer 2013
Over 16 billion reads
• 65X coverage in paired ends from a single seed • 1/3 in GAIIx, 160-‐bp overlapping pairs • 2/3 in HiSeq, 100-‐bp pairs
• 1.7 billion reads from “jumping” libraries • from pine needles, diploid DNA
See Daniela Puiu, poster 271, Friday 2pm
How to get all these reads into a single assembly run?
16 billion paired reads
Super-‐reads
• Based on the observa4on that most of the sequence in genomes is locally unique – branches are rela4vely rare
• We can efficiently count k-‐mers in the data set of all reads with Jellyfish, e.g. :
AGCTGACTGACTGGTAACAA AGCTGACTGA GCTGACTGAC • Use all k-‐mers with counts > threshold T (e.g. T=1) • The idea is to make reads longer instead of breaking them into
k-‐mers.
• Consider a read – can its ends be extended uniquely? ACTGACCAGATGACCATGACAGATACATGGT extend 5 GACTGACCAG ATACATGGTA 10 stop ATACATGGTC 2
• Typically Illumina sequencing projects generate data with high coverage (>50x). With 100bp reads this implies that a new read starts on average at least every other base:
read R extended to super read S super read S (red) the other reads extend to the S as well
Super reads Extending a read to become a super-‐read
• Consider a read CGACTGACCAGATGACCATGACAGATACATGGT stop extend 5 GACTGACCAG ATACATGGTA 10 stop extend 3 CGACTGACCA ATACATGGTC 2
• Typically Illumina sequencing projects generate data with high coverage (>50x). With 100bp reads this implies that a new read starts on average at least every other base:
read R extended to super read S super read S (red) the other reads extend to the S as well
Super reads We can keep extending on the lem
• Consider a read CGACTGACCAGATGACCATGACAGATACATGGT stop extend 5 GACTGACCAG ATACATGGTA 10 stop extend 3 CGACTGACCA ATACATGGTC 2
• Typically Illumina sequencing projects generate data with high coverage (>50x). With 100bp reads this implies that a new read starts on average at least every other base:
read R extended to super read S super read S (red) the other reads extend to the S as well
Super reads We can keep extending on the lem
Super-Reads Compress the Data
16 billion paired reads
150 million super-‐reads
• 100-‐fold compression • 50% of sequence is in super-‐reads > 500 bp
• Super-‐read total: 52 Gbp
Collect jumping reads from same haplotype
Figure Credit: Nicholas Wheeler, University of California, Davis
1.7 billion jumping reads (4 Kbp)
93 million Di-‐Tag reads (36 Kbp)
Keep only pairs where both reads match haploid
DNA
Filter: both reads had to be covered by 52-‐mers from megagametophyte data
MaSuRCA assembler performance • 64-‐core computer with 1 Terabyte of RAM • Time/memory to assemble:
• QuORUM error correc4on: 10 days / 800 GB • Super-‐reads construc4on plus filtering: 11 days / 400 GB
• Con4g and scaffold construc4on: 60+ days / 450 Gb • uses CABOG assembler
• Gap filling with super-‐reads: 8 days / 300 Gb
*Also assembled all data with SOAPdenovo, see poster 271, Daniela Puiu
Year Common Name Scien>fic Name
Assembly Size (GB)
Predicted Size (GB)
N50 Con>g (KB)
N50 Scaffold (KB)
2013 Loblolly Pine Pinus taeda 20.1 22.0 8.2 30.7
2011 Potato Solanum tuberosum 0.7 0.8 31.4 1320.0
2011 Orangutan Pongo abelii/pygmaeus 3.1 3.1 15.5 740.0
2011 Nake Mole Rat Heterocephalus glaber 2.7 19.3 1590.0
2011 Atlan4c Cod Gadus morhua 0.8 2.8 690.0
2011 Coral Reef Acropora digi<fera 0.4 0.4 10.7 190.0
2012 Gorilla Gorilla gorilla gorilla 2.9 11.9 914.0
2012 Oyster Crassostrea gigas 0.6 0.6 19.4 400.0
2013 Radish Raphanus sa<vus L 0.4 0.5 25.0
2012 Wheat Tri<cum aes<vum 5.5 17.0 0.6 0.6
Genome Assemblies for Recently Sequenced Species
Loblolly transcriptome from 30 unique RNA collec>ons
Carol Loopstra (RNA) and Keithanne Mockai>s (sequencing)
Coding transcripts, clustered outputs by assembler
transcript class Trinity 2012.10.05
Trinity 2013.02.25
Velvet 1.2.08 Oases 0.2.08
complete CDS 58,707 115,353 395,370
complete CDS, UTR poor 8,023 10,033 39,833
complete CDS, UTR very short/absent 1,076 1,393 7,298
total complete protein (non-‐unique) 67,806 126,779 442,501
par4al protein coding 196,252 404,722 2,041,836
total 264,058 531,501 2,484,337
protein coding loci, es2mated from transcript evidence alone: 87,602 unique complete
64,610 mapped to the WGS assembly
preliminary results, Keithanne Mockai>s
Does pine have 64,000 genes?
We don’t know (yet)
h"p://pinegenome.org/pinerefseq/
United States Na4onal Ins4tute Department of of Food and Agriculture Agriculture
Ongoing Efforts • Transcriptome + WGS assembly merging • Fosmid pool sequencing and assembly • Genome Annota4on
• Sugar pine genome: 35 Gigabases!
h"p://pinegenome.org/pinerefseq/
United States Na4onal Ins4tute Department of of Food and Agriculture Agriculture
PD David Neale (r), co-PD Jill Wegrzyn (c), and (l to r) John Liechty, Ben Figueroa, and
Patrick McGuire UC Davis
Co-PD Chuck Langley (r) and (l to r) Marc Crepeau, Kristian Stevens, and
Charis Cardeno UC Davis
(l to r) Co-PD Pieter de Jong, Ann Holtz-Morris, Maxim Koriabine,
Boudewijn ten Hallers CHORI BAC/PAC
Co-PD Carol Loopstra and Jeff Puryear TAMU
Co-PD Keithanne Mockaitis and Zach Smith Indiana U
Co-PD Dorrie Main WSU
DP
SS AZ JY
The Johns Hopkins and Maryland Genome Assembly Group featuring co-PD Steven Salzberg and Daniela Puiu (Johns Hopkins U) and co-PD Jim
Yorke and Aleksey Zimin (U of Maryland)
DN