1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation...

Preview:

Citation preview

1 of 42

Browsing Genes and Genomes Browsing Genes and Genomes with Ensemblwith Ensembl

Bert OverduinEnsembl User Support

EMBL Outstation

European Bioinformatics Institute

Wellcome Trust Genome Campus

Hinxton, Cambridge, UK

2 of 42

Course ScheduleCourse Schedule

IntroductionWebsite walk-through

Coffee

ExercisesBioMart

Lunch

ExercisesGeneBuild

Tea

Variations / ComparaExercises

3 of 42

Ensembl WorkshopsEnsembl Workshops

4 of 42

EMBL-EBIHinxton, Cambridge

5 of 42

Wellcome Trust Genome CampusHinxton, Cambridge

© John Freebrey (www.thedigitaldarkcloth.com)

6 of 42

7 of 42 © Sean T. McHugh (www.cambridgeincolour.com)

Cambridge

8 of 42

A Bit of HistoryA Bit of History

• 1995 Haemophilus influenzae 1.8 Mb• 1996 Yeast 12 Mb• 1998 C. elegans 100 Mb• 1999 Fruit fly 125 Mb• 2000 Arabidopsis 115 Mb• 2001 Human (draft)• 2002 Mouse 2.6 Gb• 2004 Human (“finished”) 3 Gb

Sequenced genomes

9 of 42

A Bit of HistoryA Bit of History

http://www.genomesonline.org/

10 of 42

AnnotationAnnotation

Wikipedia:Genome annotation is the process of attaching biological information to sequences. It consists of two main steps:

1. identifying elements on the genome, a process called Gene Finding, and2. attaching biological information to these elements.

Automatic annotation tools try to perform all this by computer analysis, as opposed to manual annotation which involves human expertise. Ideally, these approaches co-exist and complement each other in the same annotation pipeline.

11 of 42

Ensembl - GoalsEnsembl - Goals

• Provide automatic annotation of genomic sequence

• Integrate other biological data

• Make data available to all via the web

12 of 42

Ensembl - OrganisationEnsembl - Organisation

• Joint project between European Bioinformatics Institute (EMBL-EBI) and Wellcome Trust Sanger Institute

• Started in 1999 for the Human Genome Project• Funded primarily by the Wellcome Trust, additional

funding by EMBL, EU, NIH-NIAID, BBSRC and MRC

• Team of ca. 40 people, led by Ewan Birney (EBI) and Tim Hubbard (Sanger)

• Uses the largest dedicated computer system in biology in Europe

13 of 42

Genome BrowsersGenome Browsers

• Ensembl Genome browserhttp://www.ensembl.org

• NCBI Map Viewerhttp://www.ncbi.nlm.nih.gov/mapview/

• UCSC Genome Browserhttp://genome.ucsc.edu

14 of 42

NCBI Map ViewerNCBI Map Viewer

15 of 42

UCSC Genome BrowserUCSC Genome Browser

16 of 42

Ensembl Genome BrowserEnsembl Genome Browser

17 of 42

What Distinguishes Ensembl from What Distinguishes Ensembl from the UCSC and NCBI Browsers?the UCSC and NCBI Browsers?

• Automatic annotation for those species for which no manually curated gene set exists

• Direct database access and programmatic access via the Perl API

• Not only the data, but also the software source code is open source

18 of 42

CaveatsCaveats

• While genome browsers can be very useful tools they do not provide the definitive answer to every question!

• Data is fluid

19 of 42

Which Species Are Available?Which Species Are Available?

• 36 chordates, ranging from mammals to ‘primitive’ chordates (Ciona intestinalis and Ciona savignyi)

• 3 key eukaryote model organisms:fruitfly (Drosophila melanogaster)nematode (Caenorhabditis elegans)yeast (Saccharomyces cerevisiae)

• 2 insect pathogen vectors:malaria mosquito (Anopheles gambiae)yellow fever / dengue mosquito (Aedes aegypti)

20 of 42

Species in EnsemblSpecies in Ensembl

CAMBRI ORDO SIL DEV CARBON PER TRIA JURA CRETAC TERTIA

57

0

50

5

43

8

40

8

36

0

28

6

24

5

20

8

14

4

65

MY

BP

FISHES

BIRDSREPTILES

MAMMALS PLACENTALS

MONOTREMES

MARSUPIALS

OTHER BIRDS

PALEOGNATHS

PASSERINES

CROCODILES

TURTLES

LIZARDS

AMPHIBIANS

TELEOSTS

SHARKS

RAYS

LATIMERIA

BICHIR/POLYPTERUS

LUNGFISHES

AGNATHANS

NON-VERTEBRATES

21 of 42

More Species to Come ….More Species to Come ….

OikopleuraGorillaZebrafinchOrangutanMarmosetAmphioxusAcorn wormHyrax

MegabatDolphinTarsierKangaroo ratChinese pangolinTwo toed slothLlamaFlying lemur

22 of 42

Which Data Are Available?Which Data Are Available?• Genomic sequence• Gene/transcript/peptide models• External references• Mapped cDNAs, peptides, micro array probes,

BAC clones etc.• Other features of the genome:

cytogenetic bands, markers, repeats etc.• Comparative data:

orthologues and paralogues, protein families, whole genome alignments, syntenic regions

• Variation data:SNPs

• Regulatory data:“best guess” set of regulatory elements

• Data from external sources (DAS)

23 of 42

Gene/Transcript/Peptide ModelsGene/Transcript/Peptide Models

• Manual annotation

For parts of genomes:human, dog, mouse, zebrafish (“Vega genes”)

For complete genomes:fruitfly (FlyBase), C. elegans (WormBase), yeast

(SGD)

• Automatic predictions (“Ensembl genes”)

• EST predictions

• Ab initio predictions (GENSCAN, SNAP)

24 of 42

Biological EvidenceBiological Evidence

• UniProt/Swiss-ProtA manually curated database and therefore of highest accuracy

• NCBI RefSeqA partially manually curated database

• UniProt/TrEMBLAutomatically annotated translations of EMBL coding sequence (CDS) features

• EMBL / GenBank / DDBJPrimary nucleotide sequence repository

All Ensembl gene predictions are based on experimental evidence:

25 of 42

The Ensembl GenebuildThe Ensembl Genebuild

Genome assembly

Computer programs

Experimental evidence

Ensembl Ensembl GenesGenes

+

+

26 of 42

Ensembl IdentifiersEnsembl Identifiers

• ENSG### Ensembl Gene ID• ENST### Ensembl Transcript ID• ENSP### Ensembl Peptide ID• ENSE### Ensembl Exon ID• ENSF### Ensembl Family ID• ENSR### Ensembl Regulatory Feature ID

• For other species than human a suffix is added:MUS for mouse (Mus musculus) : ENSMUSG###,DAR for zebrafish (Danio rerio) : ENSDARG### etc.etc.

• For imported genes Ensembl uses the original identifiers

27 of 42

Access to Genome AnnotationAccess to Genome Annotation

• Release web site http://www.ensembl.org/

• Pre-Release http://pre.ensembl.org/

• Archive http://archive.ensembl.org

• BioMart http://www.ensembl.org/Multi/martview

• Downloads ftp://ftp.ensembl.org/

• MySQL interface ensembldb.ensembl.org

• Perl API http://www.ensembl.org/info/software/

28 of 42

PrPree!! and Archiv and Archivee!! Sites Sites

29 of 42

BioMart Data Mining ToolBioMart Data Mining Tool

30 of 42

DownloadsDownloadsftp://ftp.ensembl.org/pubhttp://www.ensembl.org/info/data/download.html

FASTA files: plain sequence• DNA (assembly masked and unmasked)• cDNA (Ensembl and ab initio predictions)• Peptides (Ensembl and ab initio predictions)• RNA (non-coding RNA predictions)

Flatfiles: annotated 1Mb slices• EMBL format• GenBank format

MySQL: database table dumps

31 of 42

MySQLMySQL

SQL = Structured Query Language

Needed:

• MySQL client programhttp://www.mysql.com

• Ability to write MySQL queries

• Knowledge of database schema

32 of 42

Perl APIPerl API

API = Application Programming Interface

Needed:

• BioPerl modules• Ensembl modules

• Ability to code in Perl

For more information (installation instructions,tutorials, documentation etc.):http://www.ensembl.org/info/software/index.html

33 of 42

Ensembl BLAST Ensembl BLAST

WU-BLAST 2.0:• search against assemblies, Ensembl predictions or ab

initio predictions

BLAT and SSAHA2:• BLAST-like Alignment Tool• Sequence Search and Alignment by Hashing Algorithm• very fast• search against assemblies for (almost) exact DNA-DNA

matches

Search against one or multiple speciesSearch max. 30 sequences simultaneously

34 of 42

Ensembl AccountsEnsembl Accounts

• Personalise Ensembl by saving bookmarks, view configurations and homepage preferences in a user account

• Share bookmarks and configurations by setting up groups

Please note that all Ensembl data remains free access. It is not necessary to register in order to gain access to Ensembl data!

35 of 42

Website StatisticsWebsite Statistics

On average 1,000,000 page impressions / week

Top 3 species:

Top 3 countries:

36 of 42

Ensembl – Open SourceEnsembl – Open Source

• Data and software freely available

• More than 50 installs worldwide

• Academia and industry

• Local or available via the web• Mirrors with Ensembl data, e.g.

http://ensembl.genome.tugraz.at/index.html

or user projects with own data

37 of 42

Powered by EnsemblPowered by Ensembl

38 of 42

What If I Need Help?What If I Need Help?

• Helpdesk:

helpdesk@ensembl.org

• Workshops on use of the browser or the API

• Mailing lists:

ensembl-dev@ebi.ac.uk ensembl-announce@ebi.ac.uk

• ‘Geek for a week’ program

• Animated tutorials

http://www.ensembl.org/common/Workshops_Online

39 of 42

Ensembl TeamEnsembl Team

Guy Coates, Tim Cutts, Shelley GoddardSystems & Support

Paul Flicek, Yuan Chen, Stefan Gräf, Nathan Johnson, Daniel RiosFunctional Genomics

Ewan Birney (EBI), Tim Hubbard (Sanger Institute)Leaders

Damian Keefe, Guy Slater, Michael Hoffman, Alison Meynert, Dace Ruklisa, Daniel ZerbinoResearch

Martin Hammond, Dan Lawson, Karyn MegyVectorbase Annotation

Kerstin Howe, Tina Eyre, Ian SealyZebrafish Annotation

Val Curwen, Steve Searle, Bronwen Aken, Julio Banet, Laura Clarke, Sarah Dyer, Felix Kokocinski, Jan-Hinnerck Vogel, Simon White

Analysis and Annotation Pipeline

Javier Herrero, Benoit Ballester, Kathryn Beal, Stephen Fitzgerald, Albert VilellaComparative Genomics

James Smith, Fiona Cunningham, Anne Parker, Bethan Pritchard, Stephen Rice, Steve TrevanionWeb Team

Xosé M Fernández, Bert Overduin, Michael Schuster, Giulietta SpudichOutreach & QC

Eugene Kulesha, Andy JenkinsonDistributed Annotation System (DAS)

Arek Kasprzyk, Syed Haider, Richard Holland, Damian SmedleyBioMart

Glenn Proctor, Andreas Kähäri, Ian Longden, Patrick MeidlDatabase Schema and

Core API

40 of 42Ensembl Team on the river Cam, 2006

41 of 42Ewan Birney

42 of 42

QQ&&AAQ U E S T I O N SQ U E S T I O N S

A N S W E R SA N S W E R S

Recommended