View
241
Download
3
Category
Preview:
Citation preview
1 of 42
Browsing Genes and Genomes Browsing Genes and Genomes with Ensemblwith Ensembl
Bert OverduinEnsembl User Support
EMBL Outstation
European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton, Cambridge, UK
2 of 42
Course ScheduleCourse Schedule
IntroductionWebsite walk-through
Coffee
ExercisesBioMart
Lunch
ExercisesGeneBuild
Tea
Variations / ComparaExercises
3 of 42
Ensembl WorkshopsEnsembl Workshops
4 of 42
EMBL-EBIHinxton, Cambridge
5 of 42
Wellcome Trust Genome CampusHinxton, Cambridge
© John Freebrey (www.thedigitaldarkcloth.com)
6 of 42
7 of 42 © Sean T. McHugh (www.cambridgeincolour.com)
Cambridge
8 of 42
A Bit of HistoryA Bit of History
• 1995 Haemophilus influenzae 1.8 Mb• 1996 Yeast 12 Mb• 1998 C. elegans 100 Mb• 1999 Fruit fly 125 Mb• 2000 Arabidopsis 115 Mb• 2001 Human (draft)• 2002 Mouse 2.6 Gb• 2004 Human (“finished”) 3 Gb
Sequenced genomes
9 of 42
A Bit of HistoryA Bit of History
http://www.genomesonline.org/
10 of 42
AnnotationAnnotation
Wikipedia:Genome annotation is the process of attaching biological information to sequences. It consists of two main steps:
1. identifying elements on the genome, a process called Gene Finding, and2. attaching biological information to these elements.
Automatic annotation tools try to perform all this by computer analysis, as opposed to manual annotation which involves human expertise. Ideally, these approaches co-exist and complement each other in the same annotation pipeline.
11 of 42
Ensembl - GoalsEnsembl - Goals
• Provide automatic annotation of genomic sequence
• Integrate other biological data
• Make data available to all via the web
12 of 42
Ensembl - OrganisationEnsembl - Organisation
• Joint project between European Bioinformatics Institute (EMBL-EBI) and Wellcome Trust Sanger Institute
• Started in 1999 for the Human Genome Project• Funded primarily by the Wellcome Trust, additional
funding by EMBL, EU, NIH-NIAID, BBSRC and MRC
• Team of ca. 40 people, led by Ewan Birney (EBI) and Tim Hubbard (Sanger)
• Uses the largest dedicated computer system in biology in Europe
13 of 42
Genome BrowsersGenome Browsers
• Ensembl Genome browserhttp://www.ensembl.org
• NCBI Map Viewerhttp://www.ncbi.nlm.nih.gov/mapview/
• UCSC Genome Browserhttp://genome.ucsc.edu
14 of 42
NCBI Map ViewerNCBI Map Viewer
15 of 42
UCSC Genome BrowserUCSC Genome Browser
16 of 42
Ensembl Genome BrowserEnsembl Genome Browser
17 of 42
What Distinguishes Ensembl from What Distinguishes Ensembl from the UCSC and NCBI Browsers?the UCSC and NCBI Browsers?
• Automatic annotation for those species for which no manually curated gene set exists
• Direct database access and programmatic access via the Perl API
• Not only the data, but also the software source code is open source
18 of 42
CaveatsCaveats
• While genome browsers can be very useful tools they do not provide the definitive answer to every question!
• Data is fluid
19 of 42
Which Species Are Available?Which Species Are Available?
• 36 chordates, ranging from mammals to ‘primitive’ chordates (Ciona intestinalis and Ciona savignyi)
• 3 key eukaryote model organisms:fruitfly (Drosophila melanogaster)nematode (Caenorhabditis elegans)yeast (Saccharomyces cerevisiae)
• 2 insect pathogen vectors:malaria mosquito (Anopheles gambiae)yellow fever / dengue mosquito (Aedes aegypti)
20 of 42
Species in EnsemblSpecies in Ensembl
CAMBRI ORDO SIL DEV CARBON PER TRIA JURA CRETAC TERTIA
57
0
50
5
43
8
40
8
36
0
28
6
24
5
20
8
14
4
65
MY
BP
FISHES
BIRDSREPTILES
MAMMALS PLACENTALS
MONOTREMES
MARSUPIALS
OTHER BIRDS
PALEOGNATHS
PASSERINES
CROCODILES
TURTLES
LIZARDS
AMPHIBIANS
TELEOSTS
SHARKS
RAYS
LATIMERIA
BICHIR/POLYPTERUS
LUNGFISHES
AGNATHANS
NON-VERTEBRATES
21 of 42
More Species to Come ….More Species to Come ….
OikopleuraGorillaZebrafinchOrangutanMarmosetAmphioxusAcorn wormHyrax
MegabatDolphinTarsierKangaroo ratChinese pangolinTwo toed slothLlamaFlying lemur
22 of 42
Which Data Are Available?Which Data Are Available?• Genomic sequence• Gene/transcript/peptide models• External references• Mapped cDNAs, peptides, micro array probes,
BAC clones etc.• Other features of the genome:
cytogenetic bands, markers, repeats etc.• Comparative data:
orthologues and paralogues, protein families, whole genome alignments, syntenic regions
• Variation data:SNPs
• Regulatory data:“best guess” set of regulatory elements
• Data from external sources (DAS)
23 of 42
Gene/Transcript/Peptide ModelsGene/Transcript/Peptide Models
• Manual annotation
For parts of genomes:human, dog, mouse, zebrafish (“Vega genes”)
For complete genomes:fruitfly (FlyBase), C. elegans (WormBase), yeast
(SGD)
• Automatic predictions (“Ensembl genes”)
• EST predictions
• Ab initio predictions (GENSCAN, SNAP)
24 of 42
Biological EvidenceBiological Evidence
• UniProt/Swiss-ProtA manually curated database and therefore of highest accuracy
• NCBI RefSeqA partially manually curated database
• UniProt/TrEMBLAutomatically annotated translations of EMBL coding sequence (CDS) features
• EMBL / GenBank / DDBJPrimary nucleotide sequence repository
All Ensembl gene predictions are based on experimental evidence:
25 of 42
The Ensembl GenebuildThe Ensembl Genebuild
Genome assembly
Computer programs
Experimental evidence
Ensembl Ensembl GenesGenes
+
+
26 of 42
Ensembl IdentifiersEnsembl Identifiers
• ENSG### Ensembl Gene ID• ENST### Ensembl Transcript ID• ENSP### Ensembl Peptide ID• ENSE### Ensembl Exon ID• ENSF### Ensembl Family ID• ENSR### Ensembl Regulatory Feature ID
• For other species than human a suffix is added:MUS for mouse (Mus musculus) : ENSMUSG###,DAR for zebrafish (Danio rerio) : ENSDARG### etc.etc.
• For imported genes Ensembl uses the original identifiers
27 of 42
Access to Genome AnnotationAccess to Genome Annotation
• Release web site http://www.ensembl.org/
• Pre-Release http://pre.ensembl.org/
• Archive http://archive.ensembl.org
• BioMart http://www.ensembl.org/Multi/martview
• Downloads ftp://ftp.ensembl.org/
• MySQL interface ensembldb.ensembl.org
• Perl API http://www.ensembl.org/info/software/
28 of 42
PrPree!! and Archiv and Archivee!! Sites Sites
29 of 42
BioMart Data Mining ToolBioMart Data Mining Tool
30 of 42
DownloadsDownloadsftp://ftp.ensembl.org/pubhttp://www.ensembl.org/info/data/download.html
FASTA files: plain sequence• DNA (assembly masked and unmasked)• cDNA (Ensembl and ab initio predictions)• Peptides (Ensembl and ab initio predictions)• RNA (non-coding RNA predictions)
Flatfiles: annotated 1Mb slices• EMBL format• GenBank format
MySQL: database table dumps
31 of 42
MySQLMySQL
SQL = Structured Query Language
Needed:
• MySQL client programhttp://www.mysql.com
• Ability to write MySQL queries
• Knowledge of database schema
32 of 42
Perl APIPerl API
API = Application Programming Interface
Needed:
• BioPerl modules• Ensembl modules
• Ability to code in Perl
For more information (installation instructions,tutorials, documentation etc.):http://www.ensembl.org/info/software/index.html
33 of 42
Ensembl BLAST Ensembl BLAST
WU-BLAST 2.0:• search against assemblies, Ensembl predictions or ab
initio predictions
BLAT and SSAHA2:• BLAST-like Alignment Tool• Sequence Search and Alignment by Hashing Algorithm• very fast• search against assemblies for (almost) exact DNA-DNA
matches
Search against one or multiple speciesSearch max. 30 sequences simultaneously
34 of 42
Ensembl AccountsEnsembl Accounts
• Personalise Ensembl by saving bookmarks, view configurations and homepage preferences in a user account
• Share bookmarks and configurations by setting up groups
Please note that all Ensembl data remains free access. It is not necessary to register in order to gain access to Ensembl data!
35 of 42
Website StatisticsWebsite Statistics
On average 1,000,000 page impressions / week
Top 3 species:
Top 3 countries:
36 of 42
Ensembl – Open SourceEnsembl – Open Source
• Data and software freely available
• More than 50 installs worldwide
• Academia and industry
• Local or available via the web• Mirrors with Ensembl data, e.g.
http://ensembl.genome.tugraz.at/index.html
or user projects with own data
37 of 42
Powered by EnsemblPowered by Ensembl
38 of 42
What If I Need Help?What If I Need Help?
• Helpdesk:
helpdesk@ensembl.org
• Workshops on use of the browser or the API
• Mailing lists:
ensembl-dev@ebi.ac.uk ensembl-announce@ebi.ac.uk
• ‘Geek for a week’ program
• Animated tutorials
http://www.ensembl.org/common/Workshops_Online
39 of 42
Ensembl TeamEnsembl Team
Guy Coates, Tim Cutts, Shelley GoddardSystems & Support
Paul Flicek, Yuan Chen, Stefan Gräf, Nathan Johnson, Daniel RiosFunctional Genomics
Ewan Birney (EBI), Tim Hubbard (Sanger Institute)Leaders
Damian Keefe, Guy Slater, Michael Hoffman, Alison Meynert, Dace Ruklisa, Daniel ZerbinoResearch
Martin Hammond, Dan Lawson, Karyn MegyVectorbase Annotation
Kerstin Howe, Tina Eyre, Ian SealyZebrafish Annotation
Val Curwen, Steve Searle, Bronwen Aken, Julio Banet, Laura Clarke, Sarah Dyer, Felix Kokocinski, Jan-Hinnerck Vogel, Simon White
Analysis and Annotation Pipeline
Javier Herrero, Benoit Ballester, Kathryn Beal, Stephen Fitzgerald, Albert VilellaComparative Genomics
James Smith, Fiona Cunningham, Anne Parker, Bethan Pritchard, Stephen Rice, Steve TrevanionWeb Team
Xosé M Fernández, Bert Overduin, Michael Schuster, Giulietta SpudichOutreach & QC
Eugene Kulesha, Andy JenkinsonDistributed Annotation System (DAS)
Arek Kasprzyk, Syed Haider, Richard Holland, Damian SmedleyBioMart
Glenn Proctor, Andreas Kähäri, Ian Longden, Patrick MeidlDatabase Schema and
Core API
40 of 42Ensembl Team on the river Cam, 2006
41 of 42Ewan Birney
42 of 42
QQ&&AAQ U E S T I O N SQ U E S T I O N S
A N S W E R SA N S W E R S
Recommended