28
h"p://pinegenome.org/pinerefseq/ United States Na4onal Ins4tute Department of of Food and Agriculture Agriculture The Sequencing, Assembly, and Characteriza>on of a 22 Gb conifer genome, Loblolly pine David Neale, Pieter de Jong, Chuck Langley, Dorrie Main, Keithanne Mockai>s, Steven Salzberg, Kris>an Stevens, Jill Wegrzyn, Jim Yorke, and Aleksey Zimin Univ. of Calfornia, Davis; Children’s Hospital of Oakland Research Ins4tute; Indiana Univ.; Washington State Univ.; Univ. of Maryland; and Johns Hopkins Univ.

The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

h"p://pinegenome.org/pinerefseq/  

United  States  Na4onal  Ins4tute  Department  of  of  Food  and  Agriculture  Agriculture  

The  Sequencing,  Assembly,  and  Characteriza>on  of  a  22  Gb  conifer  genome,  Loblolly  pine  

David  Neale,  Pieter  de  Jong,  Chuck  Langley,  Dorrie  Main,  Keithanne  Mockai>s,  Steven  Salzberg,  Kris>an  Stevens,  Jill  Wegrzyn,  Jim  Yorke,  and  Aleksey  Zimin  

Univ.  of  Calfornia,  Davis;  Children’s  Hospital  of  Oakland  Research  Ins4tute;  Indiana  Univ.;  Washington  State  Univ.;  Univ.  of  Maryland;  and  Johns  Hopkins  Univ.  

Page 2: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

A  truly  large  genome  

see  poster  271,  Daniela  Puiu  

Page 3: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

h"p://pinegenome.org/pinerefseq/  

United  States  Na4onal  Ins4tute  Department  of  of  Food  and  Agriculture  Agriculture  

Why  Sequence  a  Conifer  Genome?  

•  Phylogene4c  Representa4on  –  None  currently  exists.  The  conifers  (gymnosperms)  are  the  oldest  of  the  major  plant  clades,  arising  some  300  million  years  ago.    They  are  key  to  our  understanding  of  the  origins  of  gene4c  diversity  in  higher  plants.  

•  Ecological  Representa4on  –  Conifers  are  of  immense  ecological  importance,  comprising  the  dominant  life  forms  in  most  of  the  temperate  and  boreal  ecosystems  in  the  Northern  Hemisphere.      

•  Fundamental  Gene4c  Informa4on  –  Reference  sequences  provide  the  data  necessary  to  understand  conifer  biology  and  aid  in  guiding  management  of  gene4c  resources.  

Page 4: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

Source:    Jiao  et  al.,  Ancestral  polyploidy  in  seed  plants  and  angiosperms,  Nature,  Vol.  473,  May  5,  2011  

Page 5: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

h"p://pinegenome.org/pinerefseq/  

United  States  Na4onal  Ins4tute  Department  of  of  Food  and  Agriculture  Agriculture  

Elements  of  the  Conifer  Genome  Sequencing  Project  

Page 6: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

h"p://pinegenome.org/pinerefseq/  

United  States  Na4onal  Ins4tute  Department  of  of  Food  and  Agriculture  Agriculture  

Plant  Genome  Size  Comparisons  

Image  Credit:  Modified  from  Daniel  Peterson,  Mississippi  State  University  

0

5000

10000

15000

20000

25000

30000

35000

40000

0 1000 2000 3000 Arabidopsis

Oryza Populus Sorghum Glycine Zea

Pseudotsuga menziesii

Taxodium distichum

Picea abies

Picea glauca

Pinus taeda

1C D

NA

cont

ent (

Mb)

Pinus lambertiana

Page 7: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

Acquiring the DNA

Haploid Haploid megagametophyte tissue 1N Shotgun sequenced

Diploid Diploid needle tissue 2N 40 Kb cloned fosmids, pooled and sequenced

Figure  Credit:  Nicholas  Wheeler,  University  of  California,  Davis  

Page 8: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

Selec4ng  a  Megagametophyte  

•  Goal:  deep  (>50X)  representa4ve  short  insert  libraries  from  a  single  haploid  (1N )  segregant.  

•  Libraries  from  DNA  preps  of  22  megagametophytes  were  prepared,  sized  and  analyzed.  

Most  of  the  4ssue  in  a  pine  seed  is    the  haploid  megagametophyte.  

8

Page 9: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

h"p://pinegenome.org/pinerefseq/  

United  States  Na4onal  Ins4tute  Department  of  of  Food  and  Agriculture  Agriculture  

Strategy  for  De  Novo  Sequencing  of  the  Conifer  Genomes  

Two  Complementary  Approaches  

Max Output: 95 Gigabases Max. paired end reads - 640 million Max. Read Length – 2 x 150 bp

Max Output: 600 Gigabases Max. paired end reads - 6 billion Max. Read Length – 2 x 100 bp  

Page 10: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

Sequencing Strategy

60X 40X clone

Figure  Credit:  Nicholas  Wheeler,  University  of  California,  Davis  

Page 11: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

Sequencing Strategy

Today

Page 12: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

Sequencing Strategy

End of summer 2013

Page 13: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

Over  16  billion  reads  

Page 14: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

•  65X  coverage  in  paired  ends  from  a  single  seed  •  1/3  in  GAIIx,  160-­‐bp  overlapping  pairs  •  2/3  in  HiSeq,  100-­‐bp  pairs  

•  1.7  billion  reads  from  “jumping”  libraries  •  from  pine  needles,  diploid  DNA  

See  Daniela  Puiu,  poster  271,  Friday  2pm  

Page 15: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

How to get all these reads into a single assembly run?

16  billion  paired  reads  

Page 16: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

Super-­‐reads  

•  Based  on  the  observa4on  that  most  of  the  sequence  in  genomes  is  locally  unique  –  branches  are  rela4vely  rare  

•  We  can  efficiently  count  k-­‐mers  in  the  data  set  of  all  reads  with  Jellyfish,  e.g.  :  

         AGCTGACTGACTGGTAACAA AGCTGACTGA GCTGACTGAC •  Use  all  k-­‐mers  with  counts  >  threshold  T  (e.g.  T=1)  •  The  idea  is  to  make  reads  longer  instead  of  breaking  them  into  

k-­‐mers.    

Page 17: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

•  Consider  a  read  –  can  its  ends  be  extended  uniquely? ACTGACCAGATGACCATGACAGATACATGGT extend 5 GACTGACCAG ATACATGGTA 10 stop ATACATGGTC 2  

•  Typically  Illumina  sequencing  projects  generate  data  with  high  coverage  (>50x).  With  100bp  reads  this  implies  that  a  new  read  starts  on  average  at  least  every  other  base:  

                                                 read  R  extended  to  super  read  S                                                                                                                                                    super  read  S  (red)                                                                                                                                                    the  other  reads  extend  to                                                                                                                                                    the  S  as  well  

Super  reads  Extending  a  read  to  become  a  super-­‐read  

Page 18: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

•  Consider  a  read CGACTGACCAGATGACCATGACAGATACATGGT stop extend 5 GACTGACCAG ATACATGGTA 10 stop extend 3 CGACTGACCA ATACATGGTC 2  

•  Typically  Illumina  sequencing  projects  generate  data  with  high  coverage  (>50x).  With  100bp  reads  this  implies  that  a  new  read  starts  on  average  at  least  every  other  base:  

                                                 read  R  extended  to  super  read  S                                                                                                                                                    super  read  S  (red)                                                                                                                                                    the  other  reads  extend  to                                                                                                                                                    the  S  as  well  

Super  reads  We  can  keep  extending  on  the  lem  

Page 19: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

•  Consider  a  read CGACTGACCAGATGACCATGACAGATACATGGT stop extend 5 GACTGACCAG ATACATGGTA 10 stop extend 3 CGACTGACCA ATACATGGTC 2  

•  Typically  Illumina  sequencing  projects  generate  data  with  high  coverage  (>50x).  With  100bp  reads  this  implies  that  a  new  read  starts  on  average  at  least  every  other  base:  

                                                 read  R  extended  to  super  read  S                                                                                                                                                    super  read  S  (red)                                                                                                                                                    the  other  reads  extend  to                                                                                                                                                    the  S  as  well  

Super  reads  We  can  keep  extending  on  the  lem  

Page 20: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

Super-Reads Compress the Data

16  billion  paired  reads  

150  million  super-­‐reads  

•  100-­‐fold  compression  •  50%  of  sequence  is  in  super-­‐reads  >  500  bp  

•  Super-­‐read  total:  52  Gbp  

Page 21: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

Collect jumping reads from same haplotype

Figure  Credit:  Nicholas  Wheeler,  University  of  California,  Davis  

1.7  billion  jumping  reads  (4  Kbp)  

93  million  Di-­‐Tag  reads  (36  Kbp)  

Keep  only  pairs  where  both  reads  match  haploid  

DNA  

Filter:  both  reads  had  to  be  covered  by  52-­‐mers  from  megagametophyte  data  

Page 22: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

MaSuRCA assembler performance •  64-­‐core  computer  with  1  Terabyte  of  RAM  •  Time/memory  to  assemble:  

•  QuORUM  error  correc4on:  10  days  /  800  GB  •  Super-­‐reads  construc4on  plus  filtering:  11  days  /  400  GB  

•  Con4g  and  scaffold  construc4on:  60+  days  /  450  Gb  •  uses  CABOG  assembler  

•  Gap  filling  with  super-­‐reads:  8  days  /  300  Gb  

*Also  assembled  all  data  with  SOAPdenovo,  see  poster  271,  Daniela  Puiu  

Page 23: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

Year  Common  Name   Scien>fic  Name  

Assembly  Size  (GB)  

Predicted  Size  (GB)  

N50  Con>g  (KB)  

N50  Scaffold  (KB)  

2013   Loblolly  Pine  Pinus  taeda   20.1   22.0   8.2   30.7  

2011   Potato   Solanum  tuberosum   0.7   0.8   31.4   1320.0  

2011   Orangutan  Pongo  abelii/pygmaeus   3.1   3.1   15.5   740.0  

2011   Nake  Mole  Rat  Heterocephalus  glaber   2.7       19.3   1590.0  

2011   Atlan4c  Cod   Gadus  morhua   0.8       2.8   690.0  

2011   Coral  Reef   Acropora  digi<fera   0.4   0.4   10.7   190.0  

2012   Gorilla   Gorilla  gorilla  gorilla   2.9       11.9   914.0  

2012   Oyster   Crassostrea  gigas   0.6   0.6   19.4   400.0  

2013   Radish   Raphanus  sa<vus  L   0.4   0.5   25.0      

2012   Wheat   Tri<cum  aes<vum   5.5   17.0   0.6   0.6  

Genome  Assemblies  for  Recently  Sequenced  Species  

Page 24: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

Loblolly  transcriptome  from  30  unique  RNA  collec>ons  

Carol  Loopstra  (RNA)  and  Keithanne  Mockai>s  (sequencing)  

Page 25: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

Coding  transcripts,  clustered  outputs  by  assembler  

transcript  class   Trinity  2012.10.05  

Trinity    2013.02.25  

Velvet  1.2.08  Oases  0.2.08  

complete  CDS   58,707     115,353   395,370  

complete  CDS,  UTR  poor   8,023     10,033   39,833  

complete  CDS,  UTR  very  short/absent   1,076     1,393   7,298  

total  complete  protein  (non-­‐unique)   67,806     126,779   442,501  

par4al  protein  coding   196,252     404,722     2,041,836    

total     264,058     531,501   2,484,337  

   protein  coding  loci,  es2mated      from  transcript  evidence  alone:      87,602  unique  complete                      

 64,610  mapped  to  the  WGS  assembly      

preliminary  results,  Keithanne  Mockai>s  

Page 26: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

Does  pine  have  64,000  genes?  

We  don’t  know  (yet)  

Page 27: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

h"p://pinegenome.org/pinerefseq/  

United  States  Na4onal  Ins4tute  Department  of  of  Food  and  Agriculture  Agriculture  

Ongoing  Efforts  • Transcriptome  +  WGS  assembly  merging  • Fosmid  pool  sequencing  and  assembly  • Genome  Annota4on  

• Sugar  pine  genome:  35  Gigabases!  

Page 28: The’Sequencing,’Assembly,’and’ Characterizaonofa22Gbconifer … · 2013. 5. 11. · Year’ Common’ Name’ ScienficName’ Assembly’Size’ (GB) PredictedSize (GB) N50Cong

h"p://pinegenome.org/pinerefseq/  

United  States  Na4onal  Ins4tute  Department  of  of  Food  and  Agriculture  Agriculture  

PD David Neale (r), co-PD Jill Wegrzyn (c), and (l to r) John Liechty, Ben Figueroa, and

Patrick McGuire UC Davis

Co-PD Chuck Langley (r) and (l to r) Marc Crepeau, Kristian Stevens, and

Charis Cardeno UC Davis

(l to r) Co-PD Pieter de Jong, Ann Holtz-Morris, Maxim Koriabine,

Boudewijn ten Hallers CHORI BAC/PAC

Co-PD Carol Loopstra and Jeff Puryear TAMU

Co-PD Keithanne Mockaitis and Zach Smith Indiana U

Co-PD Dorrie Main WSU

DP

SS AZ JY

The Johns Hopkins and Maryland Genome Assembly Group featuring co-PD Steven Salzberg and Daniela Puiu (Johns Hopkins U) and co-PD Jim

Yorke and Aleksey Zimin (U of Maryland)

DN