1
Towards a Reference Genome for
Switchgrass (Panicum virgatum)
Jeremy Schmutz, Jarrod Chapman, Jerry Jenkins, Jane Grimwood, Kerrie
Barry, Gerald A. Tuskan, Daniel S. Rokhsar & many others
DOE Joint Genome Institute
Mission: Serving as a genomic user facility in support of DOE mission science
• Funded by Biological Environmental Science (BER)
• Walnut Creek, CA
• ~270 employees
• HiSeq (9), MiSeq (6), PacBio (2), 454 (1)
• Includes partner laboratories such as HudsonAlpha
funded for specific goals
Bioenerg
y
Carbon cycling Biogeochemistry
Plants Fungi Microbes Metagenomes 2
JGI & BRCs
• Development of next-generation
bioenergy crops
• Discovery and design of enzymes and
microbes with novel biomass-degrading
capabilities
• Development of transformational
microbe-mediated strategies for biofuel
production 3
JGI Plant Program
Flagship Plant Genomes
Flagship Comparative
Genomics
Resequencing & Population
Diversity
Transcriptomics & sequence based functional assays for
Flagship Plants
Community Organization
Plant Customi-zation
QTLs and Genotype/Phenotype
Links
4
JGI Plant Flagship Genomes
1. Provide complete genomic resources and
genomes of direct DOE mission importance
2. Support efforts for cellulosic biofuel
development and feedstock customization
3. Foster communities to develop research
programs around DOE plants
4. Build a solid foundation for diversity and
functional studies
5
JGI Plant Flagships
6
Sorghum Poplar Miscanthus
Cellulosic Feedstocks Oil Seeds
Models to understand plant biology of biofuel traits
Soybean
Chlamy Physco Brachy Foxtail Panicgrass
Switchgrass
Introduction to switchgrass
Plus:
• High cellulosic yields, marginal land, low input
plant
• Existing agronomy knowledge and breeding
from planting as a forage crop
• Perennial crop which can be annually harvested
after establishment
• Widespread native species in North America,
resistant to American pests
• Presumably large adaptive variation across the
growing regions
Minus:
• Widespread native species in North America,
very difficult to contain large scale plantings of
improved varieties
• Obligate outcrossing polyploid species
7
Switchgrass is difficult
Obligate outcrossing tetraploid!
• Difficult to inbreed
• 4 copies of genes (or maybe 2,
or 1 or more)
• Variation within and between
subgenomes
• Genome is only a reference for
one plant AP13, not for all
switchgrass or even all
Panicum virgatum individuals
A1 A2 B1 B2
A C A A
A A C C
A T C G
C C
8
Genomic view of polymorphism
9
Switchgrass Reference Project
• Goal : Produce a reference genome of AP13 that can be used for everything from marker assisted breeding and QTL identification to direct functional work on understanding cell wall biosynthesis
• Project has spanned several phases:
1. Resource development
2. Initial whole genome shotgun sequencing (v0.0)
3. Localization and assembly on chromosomes (v1.0)
4. Ongoing improvement through direct sequencing of localized regions (v2.0)
10
Project origins
• Started as a BRC project to produce a reference genome (initiated by JBEI in Nov.2008)
• Cultivar selected by the group: Alamo AP13
• DNA was isolated by Jeff Bennetzen’s group @UGA and Rod Wing’s group@AU for BAC libraries and sequencing
• Began sequencing in early 2010, work for developing resources was in progress
11
Switchgrass marker paper
November 2011, The Plant Genome 12
BAC libraries & BES
• Generated BAC libraries with 2
cut cites, 330k BES and some
clone based sequencing,
average insert size 110kb and
144kb.
• Clones available from CUGI
www.genome.clemson.edu
With: Pam Ronald’s Group @ JBEI
PLOS1 April 2012, Shama et al.
13
EST & transcripts
• 510,000 Sanger ESTs from 9 tissues with C. Tobias
• 169,079 Sanger ESTs from cell wall, 11.5 million 454
ESTs from VS16 and AP13 with BESC/Noble
Table 1. Switchgrass 454-cDNA libraries and 454-ESTs NCBI SRA
accession #
JGI lib
code Tissues
Plant growth
stage/conditions
# of good
ESTs
Mean
length (bp)
Summer VS16 454 data
SRX026147 CFBB Whole shoot Leaf development 259,106 201
SRX026148 CFBC Whole root Leaf development 205,466 222
SRX026149 CFBF Whole shoot Stem elongation 194,426 194
SRX026150 CFBG Whole root Stem elongation 174,053 190
SRX026151-2 CFCZ Whole shoot Reproductive stage 219,230 189
SRX026153-4 CFFA Whole root Reproductive stage 234,107 205
SRX026155-6 CFFB
Panicles
including seeds Reproductive stage 220,933 212
1,507,321 202
Alamo AP13 454 data
SRX057824 CCXN Whole shoot Stem elongation 733,173 202
SRX057825 CCXO Whole root Stem elongation 667,612 206
SRX057830 CFXX Whole shoot Leaf development 1,236,020 419
SRX057831 CFXY Whole root Leaf development 1,214,630 375
SRX057828 CFXW Whole shoot Stem elongation 1,357,290 223
SRX057829 CGGO Whole root Stem elongation 1,040,192 404
SRX057827 CGFF Whole shoot Reproductive stage 547,278 320
SRX057826 CGFC Whole root Reproductive stage 998,691 388
SRX057834 CGTX
Panicles
including seeds Reproductive stage 1,096,949 384
SRX057833 CGFI Whole shoot
Stem elongation 2
w/drought 362,346 213
SRX057832 CGGU Whole root
Stem elongation 2
w/drought 918,585 337
10,172,766 316
11,680,087
Sub-total
Total
Sub-total
Development of an integrated
transcript sequence database and a
gene expression atlas for gene
discovery and analysis in switchgrass
(Panicum virgatum L.) – Zhang et al.
2013 Plant Journal.
http://switchgrassgenomics.noble.org
14
Foxtail millet
• Foxtail millet sequenced with 8x sanger sequence
• Demonstrate using it as a comparative basis to
reconstruct switchgrass chromosomes
Nature Biotech May 13, 2012 15
Onward to V0.0
16
Tetraploid Switchgrass
Began with 8.3x 454 linear sequencing and added 6.5x
454 XLR+ longer sequencing (14.5x total coverage)
78% longer read length
54% longer HQ length
2x the yield per run
17
V0.0 Release PAG 2012
• Linear 454 > 200 bp
• Sampled both the “A”
and “B” genomes in
the tetraploid
• Assembled using
Newbler V2.6
• Results:
• Contig N50 of 3.8 Kb
• 1.466 Gb of total
sequence assembled.
• 80 contigs > 50 Kb
• 1,663 contigs > 25Kb
18
Annotation?
• Annotation was based solely on Sanger ESTs homology (foxtail millet, rice, Brachypodium, and sorghum)
• 65,878 total loci containing protein-coding transcripts
• 4,193 total alternatively spliced transcripts
For primary transcripts: • Average number of exons: 4.1
• Median exon length: 160
• Median intron length: 126
• Complete genes: 47,302
• Incomplete gene with start codon: 5,862
• Incomplete gene with stop codon: 10,459
19
Is this a genome?
How do we organize these fractured 410k
pieces into a reference genome and put
them into the correct subgenome?
The Map! The Map! The Map!
20
Genetic map AP13 x VS16
Switchgrass mapping population planted out at Noble 21
Mapping strategy
VS16
250 offspring (F1) of VS16xAP13
sequenced in pools of 10-12 (depth <~1X)
Find short sequences that are:
(1) Polymorphic in one parent
(2) Found in only one subgenome
(3) Not found in the other parent
These are simple markers to track by
resequencing F1 offspring.
Directly observe recombination in the
polymorphic parent.
AP13
X
Select markers like: AAAAAAAATCTCGTATGCATGGAGTACTAAATGAAGTCTATTTGCAAAAC A 15 T 12
AAAAAAAATCTCTCCAGGGCAAAAATAAAAAAATGAAAAAGAAAAAAAAA A 13 C 14
AAAAAAAATCTTCGTGAGGAATTTTCTGTGCACTTTAAGTCTTCAATAAC A 12 G 14
113,325 AP13-derived markers and 236,622 VS16-derived markers
Mapping population: Malay Saha
Map development: Jarrod Chapman 22
Initial VS16 map examples
First round of the map based on 106 offspring and VS16 specific subgenome differences and covers ~87K markers
23
New map examples
1. Second mapping round uses all 250 offspring, AP13 subgenome differences
2. Added additional markers from WGS assembly
3. 130k typed subspecific genome markers + 418k markers that are linked to these
24
Marker distributions
25
5cM bin widths
Organizing assembly with map
• Original Newbler Assembly:
• 14.5x read coverage
• 556,117 contigs (1,466.3 Mb)
• V0.0 Release at PAG Jan 2012:
• 410,030 contigs (1,358.1 Mb)
• 73,010 contigs (426 Mb) mapped
and annotated to 21,624 FTM
genes.
26
27
Bin contigs into Linkage Groups
(189,942 binned)
Subgenome duplicates
(35,683 removed)
Collapse subgenomes
contigs (36,467 collapsed)
Scaffold each linkage group using 5.5x, 2x250, 800bp
MiSEQ
Scaffold using 18x, 2x100, 4KB & 5kb
LFPE
Scaffold using 6x, 2x100, 9kb LFPE
Eliminate redundant ends on adjacent scaffolded
contigs
Position scaffolds on genetic map
THIS SYSTEM IS NOT STATIC AND IS EASILY EXTENDED TO INCLUDE
ADDITIONAL DATA, CLONE SEQUENCE, LONG READ DATA, …
Assembly process
Order scaffolds using P. hallii
synteny
Starting with 117,792 contigs
Scaffolding Performed Using Abyss
Panicum hallii (panic grass)
28
Panicum hallii
• Native southwest grass
• Closely related (~4 MYR) to
tetraploid switchgrass
• Drought tolerance model
• 660Mb, mostly inbred
• w/Tom Juenger at UT-Austin
P. Hallii synteny
• 31mers used to identify shared content
• P. virgatum scaffolds binned on P. hallii
• Orientation of P. virgatum scaffolds relative to P. hallii determined using BLAT
65 P. virgatum scaffolds ordered on
P. hallii super_61
Before After
29
30
CHR 01: 2,291 corrections, 459 bp average (100bp to 4KB), 1.05 Mb removed from 44.7 Mb
Adjacent subgenome duplicates
Original scaffold Corrected scaffold (1.4kb eliminated)
Map Integration
• 548,932 AP13 specific subgenome markers used to position scaffolds and syntenic blocks
• 56,088 map joins (with 10kb Ns) were made for 18 (2x9) linkage groups
• Map positions contain sizable blocks of contigs (10-20) that align to the same map position, cannot be ordered or orientated- placed within the context of other scaffolds
31
Map vs. chromosomes
32
All chromosomes
33
Clone alignments against scaffolds
34
158KB clone on syntenic block 107KB clone on syntenic block
Chromosome Assignments
• Asked switchgrass “power” users for recommendation on naming
• Assigned using shared 21mer content with S. italica
• Designation of "a" assigned to the P. virgatum LG that contained more shared 21mers with S. italica
35
Panicum virgatum V1.0 release
36
• Read coverage: 14.5x
• Release size: 1.22 Gb
• Contig L50: 5.7 KB
• 636.1 MB of sequence in chromosomes
• 593.4 bits off chromosomes, includes duplicate sequences
Annotation – resources
• Latest JGI pipeline for integrating RNA-seq data and
available EST data
• Included: Original sanger ESTs, 454 ESTs, minimal
FLcDNAs, 370 million pairs from GLBRC cultivars,
710 million pairs of RNA-seq data from germinating
seed, stem-node, stem-internode, blade, immature
flower
• Homology: rice, brachy, foxtail, sorghum, maize,
arabi, soybean, poplar and swiss prot
37
Annotation
38
Annotation results
• Primary transcripts (loci): 98,007
• Alternative transcripts: 27,432
• Total transcripts: 125,439
• For primary transcripts:
– Average number of exons 3.9
– Median exon length 183
– Median intron length 133
39
Length EST support Peptide homology
100% 57,584 7,311
95% 62,327 33,251
90% 63,848 41,653
75% 66,121 58,319
50% 68,391 77,960
PLEASE
DO NOT
GET ATTACHED
TO GENE NAMES!
Phytozome advertisement
40 http://www.phytozome.net
Expression data for plants
Diversity data for plants
Paralogous gene analysis
41
• Genes “A”, “B”, and remaining
contigs aligned using BLASTP
• Alignments Screened:
• >80% identity and >80%
coverage
• Length of query and target
amino acid sequences had to be
within 20% of one another
• There are a total of:
• 29,357 “A” genes
• 27,522 “B” genes
• 41,128 genes in remaining
contigs
• 98,007 total genes
SNP rates in chromosomes
42
AP13 VS16
Heterozygous SNPs 1,449,600 581,106
Homozygous SNPs 10,406 1,482,882
Heterozygous INDELs 864 924
Homozygous INDELs 1,920 9,391
Assembly Length 1,103.8 Mb 1,103.8 Mb
Callable Bases 466.0 Mb 241.6 Mb
Heterozygous Rate (Callable) 3.111 per Kb 2.405 per Kb
Homozygous Rate (Callable) 0.022 per Kb 6.137 per Kb
Improving the genome
43
Towards switchgrass 2.0
1. Build new AP13 and VS16 maps from recent
sequence data (116 genotypes + 250 originals) to
help with subgenome localization
2. Upgrade mate pair data for AP13 to new, longer,
better, stronger mate pairs
3. Continue directed clone based sequencing of
switchgrass important regions
4. Version 2.0 of the genome with ~3-400Mb of locally
sequenced contigs, integrated into chromosomes
44
PLEASE
DO NOT
GET ATTACHED
TO ORDER!
Improvement project
45
Short Scaffolds: Selected
properly projecting clones.
? ?
Not Selected
Selected
Cell Wall EST
Redundant Clone
Long Scaffolds: Tiling path covering cell wall genes.
Long Scaffs
Short Scaffs
Total
Chromosomes 335 3,237 3,572
Remaining 6 1,182 1,188
Total 341 4,419 4,760
Improvement project
46
• 96 well clone based pool with individual indexes
• Sequenced as 2x250 on HiSeq2500, assembled and minimal
manual finishing
• Add 96 paired, sized libraries run as ¼ MiSeq run as needed
Switchgrass V2.0
47
Bin contigs and clones into linkage
Groups
Remove subgenome
duplicates from clones and contigs
Scaffold each linkage group using 2x250, 800bp pairs
Scaffold using LMP pairs
Eliminate redundant ends on adjacent scaffolded
contigs
Position scaffolds on genetic map
Order scaffolds using P. hallii
synteny
New Genetic
Map
WGS Contigs
Clone Contigs
New HiSeq Frags
New LMP Pairs
New Genetic
Map
Current JGI switchgrass projects
1. Community diversity project for up to 50 genotypes (12
sampled to date) – Laura Bartley
2. eQTL study of 90 genotypes (2 samples per) for biomass/cell
wall trait variants – with Laura Bartley and Malay Saha
3. BRC switchgrass projects, QTL mapping, engineered mutants
48
Purpose Class Targets Genotypes Contributors
Support Genetics Mapping Parents
Biparental mapping parents, NAM parents
22 Saha, Brummer,
Tobias, Wu, Bonos
Diversity
Each phylogenetic group from Lui et al. 2012, including octaploids;
Mexican and NE accessions
9 Casler, Auer,
Juenger
Genome Stucture Determination
Genomic variants
dihaploid, selfed, intermediate genome size
10 Wu, Tobias, Brummer
Baseline Data Interesting phenotypes
Transformed genotype, Other
1 Wang
Current total 42
Panicum hallii projects
1. Produce draft assembly V1.0 for Panicum hallii
2. Diploid panicum eQTLs for segregating
drought/biomass traits in HAL x FIL cross
3. Diversity sampling
49
Hall’s Panicgrass diversity west Texas to east Texas, Juenger Lab FIL2 HAL HAL X FIL2
Please apply these resources!
50
Acknowledgments
DOE JGI
Jane Grimwood (HA)
Jerry Jenkins (HA)
Jarrod Chapman
Shengqiang Shu
Dan Rokhsar
Kerrie Barry
BESC
Gerald Tuskan, ORNL
Katrien Devos, UGA
Yi-Ching Lee, Noble
Malay Saha, Noble
Michael Udvardi, Noble
Jiyi Zhang, Noble
JBEI
Pam Ronald, UC-Davis
Manoj Sharma, UC-Davis
Rita Sharma, UC-Davis
Others
Laura Bartley, OU
Christian Tobias, SGEC
Chris Saski, CUGI (BAC libs)
Tom Juenger, UT-Austin
Funding Sources
DOE DE-AC02-05CH11231
ARRA UC Berkeley
51
Questions for discussion
• How often should we update the switchgrass genome and annotation?
• What else can the JGI do that would be immediately useful for the switchgrass community?
• JGI CSP2015 deadline for LOI will be March 2014 • Comprehensive community proposals are
greatly preferred!
52