The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienﬁcName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

h"p://pinegenome.org/pinerefseq/

United States Na4onal Ins4tute Department of of Food and Agriculture Agriculture

The Sequencing, Assembly, and Characteriza>on of a 22 Gb conifer genome, Loblolly pine

David Neale, Pieter de Jong, Chuck Langley, Dorrie Main, Keithanne Mockai>s, Steven Salzberg, Kris>an Stevens, Jill Wegrzyn, Jim Yorke, and Aleksey Zimin

Univ. of Calfornia, Davis; Children’s Hospital of Oakland Research Ins4tute; Indiana Univ.; Washington State Univ.; Univ. of Maryland; and Johns Hopkins Univ.

A truly large genome

see poster 271, Daniela Puiu



Why Sequence a Conifer Genome?

•  Phylogene4c Representa4on –  None currently exists. The conifers (gymnosperms) are the oldest of the major plant clades, arising some 300 million years ago. They are key to our understanding of the origins of gene4c diversity in higher plants.

•  Ecological Representa4on –  Conifers are of immense ecological importance, comprising the dominant life forms in most of the temperate and boreal ecosystems in the Northern Hemisphere.

•  Fundamental Gene4c Informa4on –  Reference sequences provide the data necessary to understand conifer biology and aid in guiding management of gene4c resources.

Source: Jiao et al., Ancestral polyploidy in seed plants and angiosperms, Nature, Vol. 473, May 5, 2011



Elements of the Conifer Genome Sequencing Project



Plant Genome Size Comparisons

Image Credit: Modified from Daniel Peterson, Mississippi State University

0

5000

10000

15000

20000

25000

30000

35000

40000

0 1000 2000 3000 Arabidopsis

Oryza Populus Sorghum Glycine Zea

Pseudotsuga menziesii

Taxodium distichum

Picea abies

Picea glauca

Pinus taeda

1C D

NA

cont

ent (

Mb)

Pinus lambertiana

Acquiring the DNA

Haploid Haploid megagametophyte tissue 1N Shotgun sequenced

Diploid Diploid needle tissue 2N 40 Kb cloned fosmids, pooled and sequenced

Figure Credit: Nicholas Wheeler, University of California, Davis

Selec4ng a Megagametophyte

•  Goal: deep (>50X) representa4ve short insert libraries from a single haploid (1N ) segregant.

•  Libraries from DNA preps of 22 megagametophytes were prepared, sized and analyzed.

Most of the 4ssue in a pine seed is the haploid megagametophyte.

8



Strategy for De Novo Sequencing of the Conifer Genomes

Two Complementary Approaches

Max Output: 95 Gigabases Max. paired end reads - 640 million Max. Read Length – 2 x 150 bp

Max Output: 600 Gigabases Max. paired end reads - 6 billion Max. Read Length – 2 x 100 bp

Sequencing Strategy

60X 40X clone


Sequencing Strategy

Today

Sequencing Strategy

End of summer 2013

Over 16 billion reads

•  65X coverage in paired ends from a single seed •  1/3 in GAIIx, 160-‐bp overlapping pairs •  2/3 in HiSeq, 100-‐bp pairs

•  1.7 billion reads from “jumping” libraries •  from pine needles, diploid DNA

See Daniela Puiu, poster 271, Friday 2pm

How to get all these reads into a single assembly run?

16 billion paired reads

Super-‐reads

•  Based on the observa4on that most of the sequence in genomes is locally unique – branches are rela4vely rare

•  We can efficiently count k-‐mers in the data set of all reads with Jellyfish, e.g. :

AGCTGACTGACTGGTAACAA AGCTGACTGA GCTGACTGAC •  Use all k-‐mers with counts > threshold T (e.g. T=1) •  The idea is to make reads longer instead of breaking them into

k-‐mers.

•  Consider a read – can its ends be extended uniquely? ACTGACCAGATGACCATGACAGATACATGGT extend 5 GACTGACCAG ATACATGGTA 10 stop ATACATGGTC 2

•  Typically Illumina sequencing projects generate data with high coverage (>50x). With 100bp reads this implies that a new read starts on average at least every other base:

read R extended to super read S super read S (red) the other reads extend to the S as well

Super reads Extending a read to become a super-‐read

•  Consider a read CGACTGACCAGATGACCATGACAGATACATGGT stop extend 5 GACTGACCAG ATACATGGTA 10 stop extend 3 CGACTGACCA ATACATGGTC 2



Super reads We can keep extending on the lem

•  Consider a read CGACTGACCAGATGACCATGACAGATACATGGT stop extend 5 GACTGACCAG ATACATGGTA 10 stop extend 3 CGACTGACCA ATACATGGTC 2



Super reads We can keep extending on the lem

Super-Reads Compress the Data

16 billion paired reads

150 million super-‐reads

•  100-‐fold compression •  50% of sequence is in super-‐reads > 500 bp

•  Super-‐read total: 52 Gbp

Collect jumping reads from same haplotype


1.7 billion jumping reads (4 Kbp)

93 million Di-‐Tag reads (36 Kbp)

Keep only pairs where both reads match haploid

DNA

Filter: both reads had to be covered by 52-‐mers from megagametophyte data

MaSuRCA assembler performance •  64-‐core computer with 1 Terabyte of RAM •  Time/memory to assemble:

•  QuORUM error correc4on: 10 days / 800 GB •  Super-‐reads construc4on plus filtering: 11 days / 400 GB

•  Con4g and scaffold construc4on: 60+ days / 450 Gb •  uses CABOG assembler

•  Gap filling with super-‐reads: 8 days / 300 Gb

*Also assembled all data with SOAPdenovo, see poster 271, Daniela Puiu

Year Common Name Scien>fic Name

Assembly Size (GB)

Predicted Size (GB)

N50 Con>g (KB)

N50 Scaffold (KB)

2013 Loblolly Pine Pinus taeda 20.1 22.0 8.2 30.7

2011 Potato Solanum tuberosum 0.7 0.8 31.4 1320.0

2011 Orangutan Pongo abelii/pygmaeus 3.1 3.1 15.5 740.0

2011 Nake Mole Rat Heterocephalus glaber 2.7 19.3 1590.0

2011 Atlan4c Cod Gadus morhua 0.8 2.8 690.0

2011 Coral Reef Acropora digi<fera 0.4 0.4 10.7 190.0

2012 Gorilla Gorilla gorilla gorilla 2.9 11.9 914.0

2012 Oyster Crassostrea gigas 0.6 0.6 19.4 400.0

2013 Radish Raphanus sa<vus L 0.4 0.5 25.0

2012 Wheat Tri<cum aes<vum 5.5 17.0 0.6 0.6

Genome Assemblies for Recently Sequenced Species

Loblolly transcriptome from 30 unique RNA collec>ons

Carol Loopstra (RNA) and Keithanne Mockai>s (sequencing)

Coding transcripts, clustered outputs by assembler

transcript class Trinity 2012.10.05

Trinity 2013.02.25

Velvet 1.2.08 Oases 0.2.08

complete CDS 58,707 115,353 395,370

complete CDS, UTR poor 8,023 10,033 39,833

complete CDS, UTR very short/absent 1,076 1,393 7,298

total complete protein (non-‐unique) 67,806 126,779 442,501

par4al protein coding 196,252 404,722 2,041,836

total 264,058 531,501 2,484,337

protein coding loci, es2mated from transcript evidence alone: 87,602 unique complete

64,610 mapped to the WGS assembly

preliminary results, Keithanne Mockai>s

Does pine have 64,000 genes?

We don’t know (yet)



Ongoing Efforts • Transcriptome + WGS assembly merging • Fosmid pool sequencing and assembly • Genome Annota4on

• Sugar pine genome: 35 Gigabases!



PD David Neale (r), co-PD Jill Wegrzyn (c), and (l to r) John Liechty, Ben Figueroa, and

Patrick McGuire UC Davis

Co-PD Chuck Langley (r) and (l to r) Marc Crepeau, Kristian Stevens, and

Charis Cardeno UC Davis

(l to r) Co-PD Pieter de Jong, Ann Holtz-Morris, Maxim Koriabine,

Boudewijn ten Hallers CHORI BAC/PAC

Co-PD Carol Loopstra and Jeff Puryear TAMU

Co-PD Keithanne Mockaitis and Zach Smith Indiana U

Co-PD Dorrie Main WSU

DP

SS AZ JY

The Johns Hopkins and Maryland Genome Assembly Group featuring co-PD Steven Salzberg and Daniela Puiu (Johns Hopkins U) and co-PD Jim

Yorke and Aleksey Zimin (U of Maryland)

DN

Documents

The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienﬁcName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong