28
Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017

Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

Omics-driven genome annotation

Annotation of eukaryoticgenomes at the Genoscope

Marion Dubarry

31st March 2017

Page 2: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

1

Genoscope

I French sequencing center, created in 1997 and part of the CEA /Institut de Génomique since 2007

I Provide high-throughput sequencing data to the Academiccommunity, and carry out in-house genomic projects

I Focus on biodiversity : de novo sequencing and metagenomicsprojects

I Coordination of large sequencing projects like Tara Oceans

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 3: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

2

Genoscope Sequencerssequencing technology

ABI 3730 XL

2 HiSeq 25002 HiSeq 4000

2 MiSeq

6 MinION MK1

1 PromethION

1 Bionano (Irys System )

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 4: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

3

Oxford Nanopore Technologysequencing technology

I USB pluginI No DNA synthesisI Error rates around 10%I Low-cost device : 1000$

nanoporetech.com

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 5: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

4

Fast evolving technologyONT correction using NaS

Quality improvement : in averageI 33% of errors in 1D with R7.3 ; 14% with R9.4I 18% of errors in 2D with R7.3 ; 9% with R9.4

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 6: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

5

Fast evolving technologyONT correction using NaS

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 7: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

6

Correction using NaSOxford Nanopore Technology

NaS is a hybrid approach that takes advantages of dataproduced by the MinION device

. https://github.com/institut-de-genomique/NaSMarion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 8: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

7

ONT correction using NaS

NaS is based on micro-assemblies to produce near perfectreads

. github.com/institut-de-genomique/NaSMarion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 9: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

8

Lymnaea stagnalis

STAGING : Lymnaea STAGnalis INternational Genome initiative

I hermaphrodite invertebrate speciesI a multi-disciplinary consortiumI need for a high quality reference genome

. Copyrights L.stagnalis pictures : E. de Roij and C. LevesqueMarion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 10: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

9

Genome sequencingLymnaea stagnalis

Sequencer fragment size read size coverageHiSeq 2000 3-15Kbp 101 135HiSeq 2500 600bp 250 39

MiSeq 400-700bp 300 12

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 11: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

10

Genome assemblyLymnaea stagnalis

Assembly with SPAdes (sspace + GapCloser)

Metrics Scaffold (bp)Assembly size 942,996,421

Number 6,640Max length 7,106,496

N50 957,215L50 297%N 7.33

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 12: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

11

Annotation pipeline at the Genoscope

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 13: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

12

Maskingannotation pipeline

Annotated TEs were provided by URGI (INRA)

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 14: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

13

RNA mappingannotation pipeline

I reads assembly using OasesI Filter 1

I Best match (blat score) / contigI Identity percent ≥ 90%I Introns ≥ 100kb are splitted

I Filter 2I Identity percent ≥ 90%I Length ratio of aligned contig ≥ 85%

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 15: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

14

Nanopore and RNA mappingLymnaea stagnalis

RNAseq # of mapped contigs % of mapped contigs % identity # exon/alignment %GT/AG %GC/AG11 mat. Illumina TruSeq Stranded 489,344 96,4 99.7 3.1 97.86 1.34

4 mat. Illumina Smarter 128,780 96 99.5 4.24 96.44 1.68

Nanopore run 1 105,718 58.9 85.7 3.5 78.3 10Nanopore run 2 65,859 47.4 85.4 3.7 80.1 9.9

NaS reads run 1 121,626 98.6 99.7 4.92 95.97 2.36NaS reads run 2 75,950 93.3 99.6 4.98 95.95 2.40

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 16: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

15

Nanopore mappingLymnaea stagnalis

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 17: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

16

Protein mappingannotation pipeline

I Gastropoda protein sequences(from UniProt)

I Protein sequences are masked(low complexity) using seg

I Filter 1I Best match (blat score) / contigI Matches with a score within

90% of the BM scoreI Introns ≥ 100kb are splitted

I Filter 2I Length ratio of aligned protein ≥

50%

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 18: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

17

Protein mappingannotation pipeline

Proteins mapping against Uniprot

Mapping metricsNumber of sequences : 112,349

Number of matches : 77,070Number of exons : 273,997

Number of exons/model : 3.56Number of mapped sequences : 63,404 (56.4%)

Number of unmapped sequences : 48,945Number of monoexonics : 41739

Number of loci : 15704

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 19: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

18

Gmoveannotation pipeline

. github.com/institut-de-genomique/gmoveMarion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 20: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

19

GmoveBenchmark with Augustus and Maker

Annotation of D.rerio (Chr. 3) withAugustus, Maker2 and Gmove

ref Gmove Augustus MakerGene count 1293 1383 2059 1227

Gene without intron 104 394 495 322gene length 20618.63 : 9376 17300.56 : 6339 17327.97 : 8043 7984.99 : 4301

exons / gene 8.27 :6 6.65 :4 5.76 :3 5.75 :3CDS length 1469.01 : 1077 1189.75 : 873 1393.04 : 945 1117.13 : 798

coding base coverage 3.0% 2.6% 4.6% 2.2%intron count 9397 7815 9798 5830

intron length 2634.93 : 898 2851.09 : 1011 3348.65 : 1331 1445.43 : 680(SN+SP)/2 30 17 22

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 21: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

20

Genome browserLymnaea stagnalis

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 22: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

21

Lymnaea stagnalis annotation

Gene prediction metrics Lymnaea stagnalis

Lottia gigantea

Assembly size 943 Mbp

359 Mbp

# predicted genes 22,499

23,340

# predicted genes without intron 3,492 (15.5%)

5,111

# exon per gene avg :med 7.59 : 5

6.01 :4

Gene size (nt) avg :med 15,299 : 8,502

5,325.63 : 2,970

CDS size (nt) avg :med 1,375 : 993

1,150.23 : 852

Intron size(nt) avg :med 1,611 : 604

778.12 : 367

Complete Single-Copy BUSCOs 97%

Lottia gigantea annotation from the JGI

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 23: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

21

Lymnaea stagnalis annotation

Gene prediction metrics Lymnaea stagnalis Lottia giganteaAssembly size 943 Mbp 359 Mbp

# predicted genes 22,499 23,340# predicted genes without intron 3,492 (15.5%) 5,111

# exon per gene avg :med 7.59 : 5 6.01 :4Gene size (nt) avg :med 15,299 : 8,502 5,325.63 : 2,970CDS size (nt) avg :med 1,375 : 993 1,150.23 : 852Intron size(nt) avg :med 1,611 : 604 778.12 : 367

Complete Single-Copy BUSCOs 97%

Lottia gigantea annotation from the JGI

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 24: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

22

Domain detectionLymnaea stagnalis

Domains were detected using InterProScan

Domains# predicted genes with domains 19,387 (86%)

# predicted genes with IPR 16,622 (74%)# predicted genes with Pfam domains 14,872 (66%)

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 25: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

23

ConclusionSTAGING project

I The genome assembly is around 943Mbp (N50 957Kbp)I 15 Illumina materials and 2 Nanopore materials and Gastropoda

protein sequences from UniProt were used for the annotationI Nanopore reads were corrected using NaSI 22,499 genes were predicted using GmoveI 86% predicted genes contain at least one domain

PerspectivesI Waiting for direct RNA sequencing ONTI Gmove is a new combiner tools : it will take nanopore reads

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 26: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

24

ConclusionGenome assembly and/or annotation at the Genoscope

Marion Dubarry | Omics-driven genome annotation | 31st March 2017 | Lille

Page 27: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I

R&D BioSeq teamwww.genoscope.cns.fr/rdbioseq

I GenoscopeI Marie-Agnès Coutellec

(INRA, Agrocampus-OuestUMR0985 Ecology andEcosystem Health, Rennes)

I Funding agencies : CEA,Genoscope and FranceGénomique

[email protected]

Page 28: Annotation of eukaryotic genomes at the Genoscope · Omics-driven genome annotation Annotation of eukaryotic genomes at the Genoscope Marion Dubarry 31st March 2017. 1 Genoscope I