31
EBI is an Outstation of the European Molecular Biology Laboratory. Ensembl annotation Bronwen Aken 21 September 2014

Ensembl annotation

Embed Size (px)

DESCRIPTION

GRC Workshop held at Churchill College on Sep 21, 2014. Talk by Bronwen Aken discussing the Ensembl approach to annotating the complete human reference assembly.

Citation preview

Page 1: Ensembl annotation

EBI is an Outstation of the European Molecular Biology Laboratory.

Ensembl annotation

Bronwen Aken

21 September 2014

Page 2: Ensembl annotation

How Ensembl started

• Ewan Birney

• Michele Clamp

• Tim Hubbard

Page 3: Ensembl annotation

Ensembl’s goals

Annotate (vertebrate)

genome

Integrate with other biological

data

Make publicly

available

• Stable, automatic annotation

• High quality

• Regular release cycles• Open source

“Provide a bioinformatics framework to organise biology around

the sequences of large genomes”

Page 4: Ensembl annotation

Challenges

1. Find functional elements in a genome

• Data have lots of noise

2. Software / hardware

• Storing and manipulating data

3. Intuitive and comprehensive access to data

• Visualization

Page 5: Ensembl annotation

GRCh38 annotation in Ensembl

Page 6: Ensembl annotation

What is Genebuilding?

• Automatic, evidence-based annotation of

genes

• Not ab initio

• Based on sequence alignment

• “Best-in-genome”

• Aim for high specificity

• Prefer to miss a few features than heavily over-

predict

Automated gene annotation pipeline is designed

around decisions made during manual annotation

Page 7: Ensembl annotation

Advantages of re-annotating

• Add new genes to new / fixed genomic regions

• Updated supporting evidence: Remove models built on

data that has been deleted from archives

• Move alignments to regions with better mapping

Page 8: Ensembl annotation

Gene annotation pipeline – the basics

Identify interesting regions

• Rough alignment of sequences to genome

Exhaustive alignment to produce transcript models

Filter models

• Prioritize data sources

Produce ‘best guess’ gene set

Page 9: Ensembl annotation

Repeatmasking

Same-species proteins Other-species proteins

cDNAs/ESTs

UTR addition

Final gene set

Filtering

Protein-coding genebuild

Filtering

TranscriptConsensus

LayerAnnotation

Also:

Small ncRNAs

LincRNAs

Pseudogenes

Page 10: Ensembl annotation

Repeatmasking

Same-species proteins Other-species proteins

cDNAs/ESTs

UTR addition

Final gene set

Filtering

Protein-coding genebuild

FilteringRNA-Seq models

Also:

Small ncRNAs

LincRNAs

Pseudogenes

MERGE WITH HAVANA

Page 11: Ensembl annotation

Release cycle

26 September 201411

Regulation

Gene

Allele

Conserved

sequence

Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/

Genes

• Coding & noncoding

• Protein & mRNA alignments

• GTF & BAM files

Compara

• Conserved DNA sequence

• Multiple genome alignments

• Homologues

• Protein families

Regulatory regions

• DNA methylation

• TFBS

• Open chromatin

Variation

• SNPs, indels, structural variation

• Phenotypes

• QTLs

Page 12: Ensembl annotation

Integrate with other speciesC

him

panzee

Hum

an

Gene SLC12A1

Page 13: Ensembl annotation

‘Patch’ annotation in Ensembl

Page 14: Ensembl annotation

Genome assembly representation

• Coord_system table

• Lists the allowed coordinate systems

• chromosome, scaffold, contig

• With ‘versions’

• GRCh37, GRCh38

• Contigs are shared between assemblies so have no version

• ‘Toplevel’ coordinate system

• Chromosomes + unplaced scaffolds + unlocalized scaffolds

+ alternate sequences

• Most popular means to access the whole genome

• API options for including/excluding alternate sequences and

PAR

Page 15: Ensembl annotation

Genome assembly representation

GRCh38

Scaffolds

Contigs

Chromosome

DNA only loaded for contigs

Page 16: Ensembl annotation

Genome assembly representation

GRCh38

Scaffolds

Contigs

Chromosome

DNA only loaded for contigs

Page 17: Ensembl annotation

Genome assembly representation

GRCh38

Scaffolds

Contigs

Chromosome

Page 18: Ensembl annotation

Genome assembly representation

GRCh38

Scaffolds

Contigs

Chromosome

GRCh37

Page 19: Ensembl annotation

Genome assembly representation

GRCh38

Scaffolds

Contigs

Chromosome

GRCh37

Page 20: Ensembl annotation

Seq_region names

• Regions of the genome are given a slice name; it’s like an

address

• eg. chromosome:GRCh37:6:133090509:133119701:1

• Users like to say, ‘chromosome 6’

• INSDC coordinates are versioned, but less human-readable

• chromosome:GRCh37:CM000668.1:133090509:133119701:1

assembly

seq_region.

name

coord_system

start

end

strand

Page 21: Ensembl annotation

Alternate sequences

• Assembly_exception table defines ‘bubbles’

• Initially set up to handle Y chromosome PAR

• Adapted to work for MHC haplotypes

• Now also used for GRC patches

• Assumes ‘equivalent’ region will be present in primary

assembly

Page 22: Ensembl annotation

Gene annotation on a ‘patched’ genome

62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH

Assembly excepti...

SNORA76 >

SNORD104 >

MILR1 >

Genes (GENCODE...

Primary assembly...

AC025362.12 > AC016489.18 > < AC234063.4Contigs

< Y_RNA < hsa-mir-1273e

< AC234063.1

< TEX2 < AC016489.1

< PECAM1

Genes (GENCODE...

H.sap-H.sap lastz-...

Assembly excepti...

62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH

protein coding merged Ensembl/Havana

RNA gene pseudogene

Alternative alleles Projection

Gene Legend

62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17

Assembly excepti...

H.sap-H.sap lastz-...

SNORA76 >

SNORD104 >

AC138744.2 >

MILR1 > Genes (GENCODE...

GL383558.1

... ...GRC alignment i...

AC025362.12 > AC016489.18 > < AC009994.10Contigs

< TEX2 < RPL31P57 < POLG2 Genes (GENCODE...

Assembly excepti...

62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17

Insert relative to reference Delete relative to reference ... Large insert shown truncated due to image scale or edge MatchAlignment Differe...

protein coding merged Ensembl/Havana

RNA gene pseudogene

Alternative alleles Projection

Gene Legend

331.04 kb Forward strand

Reverse strand 331.04 kb

276.06 kb Forward strand

Reverse strand 276.06 kb

TEX2 gene lies across

the patch boundary

PECAM1 is annotated

only on patch HG183

Gap in primary

assembly

Pa

tch

ed

ch

rom

oso

me

Prim

ary

ch

rom

oso

me

Page 23: Ensembl annotation

Gene annotation on a ‘patched’ genome

Page 24: Ensembl annotation

Gene annotation on patches

Patch

Primary

Page 25: Ensembl annotation

Gene annotation on patches

Patch

Primary

1. Manual

annotation

Page 26: Ensembl annotation

Gene annotation on patches

Patch

Primary

Patch

Primary

2. Project

models to

patch

1. Manual

annotation

Page 27: Ensembl annotation

Gene annotation on patches

Patch

Primary

Patch

Primary

Patch

Primary

1. Manual

annotation

2. Project

models to

patch

3. Gap-fill

with mini

genebuilld

Page 28: Ensembl annotation

Ongoing challenges

• How strict should we be when aligning proteins cDNAs to

the genome?

1. Genome assembly

• Sequencing error (inversion, artificial duplication)

• Assembly incomplete

• Alignments must allow for truncated matches

2. Population variation

• Linear genome is made from ‘one’ individual vs protein

databases contain data from many unknown individuals

• Paralogues, gene families, pseudogenes

3. Public databases eg. UniProt

• Include suspect data and incomplete for many species

• When there’s a match, or no match, is it biologically real?

• Aligning proteins from other species must allow for mismatches

SpecificitySensitivity

Page 29: Ensembl annotation

FundingEuropean Commission

Framework Programme 7

Ensembl Acknowledgements

Page 30: Ensembl annotation

Questions?

Page 31: Ensembl annotation

Reporting data to usersVisualisation and Data querying:

• - When browsing the primary assembly, how do we make it obvious to users

when alternate sequences are available?

• - How do we show when the alternate genomic sequences are identical or differ

from one another?

• - How do we show whether the alternate genome sequences result in identical or

different transcribed / translated products?

• - How do we make a qualitative call about which allele is “better” to use? eg. ABO

• - Data download options

• - Concept of a ‘canonical’ transcript per gene (per tissue)

Data analysis:

• - Linking between alternate alleles (and paralogues?)

• - How do we show when data have been mapped from an old to new assembly,

compared to freshly aligned to a new assembly? When is it right to map instead of

align?

• - In a non-linear genome model, how will SNPs (rsIDs) work?

• - In a non-linear genome model, what coordinate system should be used?