30
The Ensembl Gene set The Ensembl Gene set The “Genebuild” The “Genebuild” 21 April 2008

The Ensembl Gene set The “Genebuild” 21 April 2008

Embed Size (px)

Citation preview

Page 1: The Ensembl Gene set The “Genebuild” 21 April 2008

The Ensembl Gene setThe Ensembl Gene setThe “Genebuild”The “Genebuild”

21 April 2008

Page 2: The Ensembl Gene set The “Genebuild” 21 April 2008

2 of 32

The GeneBuild (determining the Ensembl gene set)

What it means for the scientist? ‘annotation pipeline’ vs ‘manual curation’

Pseudogenes ncRNAs The CCDS project

OutlineOutline

Page 3: The Ensembl Gene set The “Genebuild” 21 April 2008

3 of 32

What is available?

I) Sequence Assemblies from genome sequencing efforts

IntroductionIntroduction

Page 4: The Ensembl Gene set The “Genebuild” 21 April 2008

4 of 32

Gene Sequencing- Gene Sequencing- the Assemblythe Assembly

http://seqcore.brcf.med.umich.edu/doc/educ/dnapr/sequencing.htmlThis generates clones, vs new sequencing methods

Page 5: The Ensembl Gene set The “Genebuild” 21 April 2008

5 of 32

Clones AvailableClones Available

Human:

(Tilepath- used in the assembly)

Ciona intestinalis

Shotgun assembly

Page 6: The Ensembl Gene set The “Genebuild” 21 April 2008

6 of 32

ContigView: Clones and ContigsContigView: Clones and Contigs

Contigs

Clones(Plate/well numbers) Ensembl

Transcripts

Page 7: The Ensembl Gene set The “Genebuild” 21 April 2008

7 of 32

Task:

View the tilepath clone in ContigView for the region containing the human

BRCA2 gene.

Hint: Start with a search for the BRCA2 gene.

Page 8: The Ensembl Gene set The “Genebuild” 21 April 2008

8 of 32

The Ensembl GenesetThe Ensembl Geneset

How does Ensembl use mRNA and protein information along with the sequence assembly to define distinct genes on the genome?

Protein Sequence Assembly Ensembl Geneset

Page 9: The Ensembl Gene set The “Genebuild” 21 April 2008

9 of 32

Once the Assembly is Imported…Once the Assembly is Imported…

Proteins/mRNAs are aligned.

These have been submitted to databases such as:

UniProt (manually curated) and

RefSeq (partially manually curated)

Page 10: The Ensembl Gene set The “Genebuild” 21 April 2008

10 of 32

The BiologicalThe Biological EvidenceEvidence

UniProt/Swiss-Prot

A manually curated database and therefore of highest accuracy

NCBI RefSeq

A partially manually curated database

UniProt/TrEMBL

Automatically annotated translations of EMBL coding sequence (CDS) features

EMBL / GenBank / DDBJ

Primary nucleotide sequence repository

All Ensembl gene predictions are based on experimental evidence:

Page 11: The Ensembl Gene set The “Genebuild” 21 April 2008

11 of 32

Database RelationshipDatabase Relationship

NCBIRefSeq

EMBL-BankDDBJ

GenBank

UniProt

Swiss-Prot TrEMBL

IndividualLab’s

Submission

Page 12: The Ensembl Gene set The “Genebuild” 21 April 2008

12 of 32

Sequence(Assembly)

Proteins(e.g. Swiss-Prot)

mRNA

EST

Manual annotation (HAVANA)

ESTgenes

Ensembl

GenebuildGenebuild

EMBL-BankGenBank

DDBJ

Page 13: The Ensembl Gene set The “Genebuild” 21 April 2008

13 of 32

Ensembl genes may be based on multiple protein/mRNAs

What is an Ensembl gene based on?

Why do I want to know?…Why do I want to know?…

Page 14: The Ensembl Gene set The “Genebuild” 21 April 2008

14 of 32

Task

Look at the evidence for the human EPO gene.

What was this gene based on?

Hint: Go to Exon Information from the GeneView page

Page 15: The Ensembl Gene set The “Genebuild” 21 April 2008

15 of 32

EPO gene supporting evidence

Page 16: The Ensembl Gene set The “Genebuild” 21 April 2008

16 of 32

Species-Specific GeneBuildsSpecies-Specific GeneBuilds

Pan troglodytes genes are built by projection from human genes.

Zebrafish has many gene duplications.

Homo sapiens genes must have

protein evidence, not just mRNA.

Page 17: The Ensembl Gene set The “Genebuild” 21 April 2008

17 of 32

Task

When was the chimpanzee (Pan troglodytes) Genebuild performed?

Can you find information as to how genes were annotated?

Hint: Look on the chimpanzee index page

Page 18: The Ensembl Gene set The “Genebuild” 21 April 2008

18 of 32

External Gene Set: VEGA/HavanaExternal Gene Set: VEGA/Havana

Human, zebrafish, mouse and dog

Havana transcripts in blue or gold…

What are Havana transcripts?

Page 19: The Ensembl Gene set The “Genebuild” 21 April 2008

20 of 32

Havana and Ensembl match

When a Havana (manually curated) and Ensembl (automatic methods) predictthe same transcript, basepair for basepair, the transcripts are merged and

coloured gold.

Page 20: The Ensembl Gene set The “Genebuild” 21 April 2008

21 of 32

Manually-curated gene sets in Manually-curated gene sets in EnsemblEnsembl

Vega (Havana)

Homo sapiens, Danio rerio,

Mus musculus and Canis familiaris

WormBase Caenorhabditis elegans

FlyBase Drosophila melanogaster

SGD Saccharomyces cerevisiae

Page 21: The Ensembl Gene set The “Genebuild” 21 April 2008

23 of 32

What Can Go Wrong?What Can Go Wrong?

I) A Gap in the assembly

Gene might not be found in Ensembl

II) Fused genes

BLAST hit(SwissProt

entry)

Gene might be associated with two names

Page 22: The Ensembl Gene set The “Genebuild” 21 April 2008

24 of 32

The genome sequence The Genebuild ‘manual curation’ by Havana Other: EST gene set

Pseudogenes

ncRNAs

OutlineOutline

Page 23: The Ensembl Gene set The “Genebuild” 21 April 2008

25 of 32

Expressed Sequence Tags vs Expressed Sequence Tags vs ‘cDNA’‘cDNA’

ESTs are annotated separately. Why?

mRNA and cDNA used in the GeneBuild:Sequenced to high standard, often complete.

EST: Lower quality sequence.

‘One shot’ sequencing of cDNA from the 5’ and 3’ end creates the EST sequence. ESTs are only 500-800 nucleotides longLow quality fragment- sequence error of ~2%.

BUT confers useful expression information discovery of new genes esp in diseased organisms Tissue type Timing/developmental stage Samples more transcripts, variants

Page 24: The Ensembl Gene set The “Genebuild” 21 April 2008

26 of 32

Where Can I See This EST Geneset?Where Can I See This EST Geneset?ContigView ContigView

Choose EST genes

EST track

Page 25: The Ensembl Gene set The “Genebuild” 21 April 2008

27 of 32

Pseudogenes: ‘False’ GenesPseudogenes: ‘False’ Genes

Unprocessed

Produced by gene duplication andrearrangement

Reverse transcription and re-integration

mRNA

pseudogene

AAAAAA

Processed

AAAAAA

Page 26: The Ensembl Gene set The “Genebuild” 21 April 2008

28 of 32

ncRNAs (non coding RNAs)ncRNAs (non coding RNAs)

What types are in Ensembl?

tRNA (transfer RNA)

rRNA (ribosomal RNA)

scRNA (small cytoplasmic)

snRNA (small nuclear)

snoRNA (small nucleolar)

miRNA (microRNA)

Page 27: The Ensembl Gene set The “Genebuild” 21 April 2008

29 of 32

ncRNAs (2 types)ncRNAs (2 types)

I) RNA with low homology can be identified through conserved 2ary structure (search genome using Rfam pattern)

II) High sequence conservation (miRNA)

BLAST alignment

‘RNA fold’ applied to make sure

sequences can fold (hairpin)

Page 28: The Ensembl Gene set The “Genebuild” 21 April 2008

30 of 32

ncRNAs… where can I see them?ncRNAs… where can I see them?

Find them in ContigView:

or use BioMart.

Page 29: The Ensembl Gene set The “Genebuild” 21 April 2008

31 of 32

*All Ensembl genes are based on biological evidence (protein and mRNA)

One Ensembl gene may come from proteins and mRNAs in various databases.

Havana (manually curated) genes are incorporated into the Ensembl geneset, merged for human.

The CCDS set strives for consensus coding sequences across databases.

Pseudogenes and RNAs are annotated, along with a separate EST gene set.

Summary – Ensembl GenesSummary – Ensembl Genes

Page 30: The Ensembl Gene set The “Genebuild” 21 April 2008

32 of 32

For more on GeneBuild:For more on GeneBuild:

Help and Documentation

(About Ensembl)

http://www.ensembl.org/info/about/docs/genome_annotation.html