Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Omics-driven genome annotation
Annotation of eukaryoticgenomes at the Genoscope
Marion Dubarry
31st March 2017
1
Genoscope
I French sequencing center, created in 1997 and part of the CEA /Institut de Génomique since 2007
I Provide high-throughput sequencing data to the Academiccommunity, and carry out in-house genomic projects
I Focus on biodiversity : de novo sequencing and metagenomicsprojects
I Coordination of large sequencing projects like Tara Oceans
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
2
Genoscope Sequencerssequencing technology
ABI 3730 XL
2 HiSeq 25002 HiSeq 4000
2 MiSeq
6 MinION MK1
1 PromethION
1 Bionano (Irys System )
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
3
Oxford Nanopore Technologysequencing technology
I USB pluginI No DNA synthesisI Error rates around 10%I Low-cost device : 1000$
nanoporetech.com
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
4
Fast evolving technologyONT correction using NaS
Quality improvement : in averageI 33% of errors in 1D with R7.3 ; 14% with R9.4I 18% of errors in 2D with R7.3 ; 9% with R9.4
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
5
Fast evolving technologyONT correction using NaS
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
6
Correction using NaSOxford Nanopore Technology
NaS is a hybrid approach that takes advantages of dataproduced by the MinION device
. https://github.com/institut-de-genomique/NaSMarion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
7
ONT correction using NaS
NaS is based on micro-assemblies to produce near perfectreads
. github.com/institut-de-genomique/NaSMarion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
8
Lymnaea stagnalis
STAGING : Lymnaea STAGnalis INternational Genome initiative
I hermaphrodite invertebrate speciesI a multi-disciplinary consortiumI need for a high quality reference genome
. Copyrights L.stagnalis pictures : E. de Roij and C. LevesqueMarion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
9
Genome sequencingLymnaea stagnalis
Sequencer fragment size read size coverageHiSeq 2000 3-15Kbp 101 135HiSeq 2500 600bp 250 39
MiSeq 400-700bp 300 12
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
10
Genome assemblyLymnaea stagnalis
Assembly with SPAdes (sspace + GapCloser)
Metrics Scaffold (bp)Assembly size 942,996,421
Number 6,640Max length 7,106,496
N50 957,215L50 297%N 7.33
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
11
Annotation pipeline at the Genoscope
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
12
Maskingannotation pipeline
Annotated TEs were provided by URGI (INRA)
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
13
RNA mappingannotation pipeline
I reads assembly using OasesI Filter 1
I Best match (blat score) / contigI Identity percent ≥ 90%I Introns ≥ 100kb are splitted
I Filter 2I Identity percent ≥ 90%I Length ratio of aligned contig ≥ 85%
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
14
Nanopore and RNA mappingLymnaea stagnalis
RNAseq # of mapped contigs % of mapped contigs % identity # exon/alignment %GT/AG %GC/AG11 mat. Illumina TruSeq Stranded 489,344 96,4 99.7 3.1 97.86 1.34
4 mat. Illumina Smarter 128,780 96 99.5 4.24 96.44 1.68
Nanopore run 1 105,718 58.9 85.7 3.5 78.3 10Nanopore run 2 65,859 47.4 85.4 3.7 80.1 9.9
NaS reads run 1 121,626 98.6 99.7 4.92 95.97 2.36NaS reads run 2 75,950 93.3 99.6 4.98 95.95 2.40
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
15
Nanopore mappingLymnaea stagnalis
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
16
Protein mappingannotation pipeline
I Gastropoda protein sequences(from UniProt)
I Protein sequences are masked(low complexity) using seg
I Filter 1I Best match (blat score) / contigI Matches with a score within
90% of the BM scoreI Introns ≥ 100kb are splitted
I Filter 2I Length ratio of aligned protein ≥
50%
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
17
Protein mappingannotation pipeline
Proteins mapping against Uniprot
Mapping metricsNumber of sequences : 112,349
Number of matches : 77,070Number of exons : 273,997
Number of exons/model : 3.56Number of mapped sequences : 63,404 (56.4%)
Number of unmapped sequences : 48,945Number of monoexonics : 41739
Number of loci : 15704
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
18
Gmoveannotation pipeline
. github.com/institut-de-genomique/gmoveMarion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
19
GmoveBenchmark with Augustus and Maker
Annotation of D.rerio (Chr. 3) withAugustus, Maker2 and Gmove
ref Gmove Augustus MakerGene count 1293 1383 2059 1227
Gene without intron 104 394 495 322gene length 20618.63 : 9376 17300.56 : 6339 17327.97 : 8043 7984.99 : 4301
exons / gene 8.27 :6 6.65 :4 5.76 :3 5.75 :3CDS length 1469.01 : 1077 1189.75 : 873 1393.04 : 945 1117.13 : 798
coding base coverage 3.0% 2.6% 4.6% 2.2%intron count 9397 7815 9798 5830
intron length 2634.93 : 898 2851.09 : 1011 3348.65 : 1331 1445.43 : 680(SN+SP)/2 30 17 22
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
20
Genome browserLymnaea stagnalis
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
21
Lymnaea stagnalis annotation
Gene prediction metrics Lymnaea stagnalis
Lottia gigantea
Assembly size 943 Mbp
359 Mbp
# predicted genes 22,499
23,340
# predicted genes without intron 3,492 (15.5%)
5,111
# exon per gene avg :med 7.59 : 5
6.01 :4
Gene size (nt) avg :med 15,299 : 8,502
5,325.63 : 2,970
CDS size (nt) avg :med 1,375 : 993
1,150.23 : 852
Intron size(nt) avg :med 1,611 : 604
778.12 : 367
Complete Single-Copy BUSCOs 97%
Lottia gigantea annotation from the JGI
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
21
Lymnaea stagnalis annotation
Gene prediction metrics Lymnaea stagnalis Lottia giganteaAssembly size 943 Mbp 359 Mbp
# predicted genes 22,499 23,340# predicted genes without intron 3,492 (15.5%) 5,111
# exon per gene avg :med 7.59 : 5 6.01 :4Gene size (nt) avg :med 15,299 : 8,502 5,325.63 : 2,970CDS size (nt) avg :med 1,375 : 993 1,150.23 : 852Intron size(nt) avg :med 1,611 : 604 778.12 : 367
Complete Single-Copy BUSCOs 97%
Lottia gigantea annotation from the JGI
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
22
Domain detectionLymnaea stagnalis
Domains were detected using InterProScan
Domains# predicted genes with domains 19,387 (86%)
# predicted genes with IPR 16,622 (74%)# predicted genes with Pfam domains 14,872 (66%)
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
23
ConclusionSTAGING project
I The genome assembly is around 943Mbp (N50 957Kbp)I 15 Illumina materials and 2 Nanopore materials and Gastropoda
protein sequences from UniProt were used for the annotationI Nanopore reads were corrected using NaSI 22,499 genes were predicted using GmoveI 86% predicted genes contain at least one domain
PerspectivesI Waiting for direct RNA sequencing ONTI Gmove is a new combiner tools : it will take nanopore reads
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
24
ConclusionGenome assembly and/or annotation at the Genoscope
Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille
R&D BioSeq teamwww.genoscope.cns.fr/rdbioseq
I GenoscopeI Marie-Agnès Coutellec
(INRA, Agrocampus-OuestUMR0985 Ecology andEcosystem Health, Rennes)
I Funding agencies : CEA,Genoscope and FranceGénomique