Upload
truongnhi
View
222
Download
5
Embed Size (px)
Citation preview
EMBL-EBI Tel. +44 (0) 1223 494 444 Wellcome Trust Genome Campus [email protected] Hinxton, Cambridgeshire, CB10 1SD, UK www.ebi.ac.uk
How UniProtKB Maps Genomes And Variants and Provides This Information.
Hermann ZELLNER, Benoît BELY, Andrew NIGHTINGALE, Alan WILTER SOUSA DA SILVA and Maria JESUS MARTIN EMBL - European Bioinformatics Institute, Welcome Trust Genome Campus,CB10 1SD Hinxton, Cambridge, United Kingdom
Complete Proteomes for Complete Genomes
UniProt is mainly supported by the Na5onal Ins5tutes of Health (NIH) grant 1U41HG006104-‐01. Addi5onal support for the EMBL-‐EBI's involvement in UniProt comes from EMBL and the NIH GO grant 2P41HG02273-‐07. UniProt ac5vi5es at the SIB are addi5onally supported by the Swiss Federal Government through the State Secretariat for Educa5on, Research and Innova5on SERI, and by the EC grants GEN2PHEN (200754) and MICROME (222886-‐2). PIR's UniProt ac5vi5es are also supported by the NIH grants 5R01GM080646-‐07, 3R01GM080646-‐07S1, 5G08LM010720-‐03, and 8P20GM103446-‐12, and the Na5onal Science Founda5on (NSF) grant DBI-‐1062520.
Human Proteome, The exception proves the rule
A new interface will be introduced to seamlessly integrate UniProt taxonomy to the associated proteomes. The expanded view for individual taxa will allow users to choose between alternate complete proteomes available for many species. In addition, Reference proteomes will aid users in making an informed choice.
New space, new concept, new interface
Figure 1. Proteome pipeline. How complete and reference proteomes are made.
Using an example to demonstrate the mapping pipeline in more detail, Figure 3. The transcription and translation products of the human gene GSTZ1 are mapped to the two existing isoforms of MAAI_HUMAN by amino acid identity. The first transcript is mapped to isoform-1. Two separate gene transcripts with identical amino acid sequence are mapped to isoform-2. A forth gene product is mapped to an existing TrEMBL entry, A6NED0; whilst a final unknown product is not found in UniProtKB and therefore is added to UniProtKB as a new TrEMBL record.
UniProt has annotated the complete Homo sapiens proteome and approximately 20,000 protein coding genes are represented by a canonical protein sequence in UniProtKB/Swiss-Prot. Studies on sequence variation are increasing; projects like 1000 Genomes and the Cancer Genome Project are generating a vast amount of variant information that is, or will be, stored by Ensembl in their variation databases. This exponential growth of variation information means a new automated strategy is required for the selection and importing of biomedical sciences and clinical relevant variants from Ensembl variation into UniProtKB.
Figure 2. New proteome interface. The proposed new interface will allow users a choice of proteomes where available, with one or more designated ‘Reference’ proteomes.
The source of the UniProtKB complete proteomes are genomes from INSDC and Ensembl and Ensembl Genomes. For INSDC, all annotated proteins are imported in UniProtKB (UniProtKB/TrEMBL) but only proteins coming from complete, annotated genomes and WGS genomes detected as complete will be tagged with the keyword Complete Proteome. For Ensembl, all predicted protein sequences are mapped to UniProtKB under stringent conditions: 100% identity over 100% of the length of the two sequences. Any sequence found to be absent from UniProtKB is imported. All UniProtKB entries that map to an Ensembl peptide are used to build the proteome; they are tagged and a cross-reference is added.
Mapping the UniProt Human Reference Proteome
to the Reference Genome and Variation Data
Andrew Nightingale1, Jie Luo1, Maria Martin1 and the UniProt Consortium1, 2, 3
1EMBL-European Bioinformatics Institute, Cambridge, UK
2SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
3Protein Information Resource, Georgetown University, Washington DC & University of Delaware, USA
Email: [email protected]
URL: www.uniprot.org
Funding
UniProt is funded by the European Molecular Biology Laboratory, National Institutes of Health, European Union, Swiss Federal Government, British Heart Foundation and National Science Foundation.
1) The Genome Reference Consortium, http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/index.shtml
2) Durbin, R. et al., A map of human genome variation from population scale sequencing., Nature 467(7319):1061 (2010)
3) Cancer Genome Project: http://www.sanger.ac.uk/research/projects/cancergenome/
4) Chen Y. et al., Ensembl Variation Resources., BMC Genomics 11(1):293 (2010)
5) Simon A. Forbes et al., COSMIC: http://cancer.sanger.ac.uk
ReferencesFuture Developments● UniProt intend to extend the variant import pipeline to include other
species with a complete proteome.
Mapping Variants to the UniProt
Human Reference Proteome
Invaluable Information Provided by
Variants
● Now that UniProt has the human reference proteome mapped to the
human reference genome, UniProt has developed a pipeline to import
high-quality 1000 Genomes(2) and COSMIC(5) non-synonymous single
amino acid variants from Ensembl variation(4), Figure 2.● 389,935 single amino acid variants have been identified for import into
the UniProt human reference proteome.
Introduction Mapping the Complete Human
Proteome to the Reference Genome● Mapping was made possible through a process of aligning all human
protein sequences in the UniProt Knowledgebase (UniProtKB) to the
protein translations in Ensembl, based on 100% amino acid identity over
the entire sequence.● Using an example to demonstrate the mapping pipeline in more detail,
Figure 1. The transcription and translation products of the human gene
GSTZ1 are mapped to the two existing isoforms of MAAI_HUMAN by
amino acid identity. The first transcript is mapped to isoform-1. Two
separate gene transcripts with identical amino acid sequence are mapped
to isoform-2. A forth gene product is mapped to an existing TrEMBL entry,
A6NED0; whilst a final unknown product is not found in UniProtKB and
therefore is added to UniProtKB as a new TrEMBL record.
Figure 1: Mapping of the Ensembl Human BRCA1 gene to the protein sequences of UniProtKB's
BRCA1_HUMAN Entry P38398.
● UniProt has annotated the complete Homo sapiens proteome and
approximately 20,000 protein coding genes are represented by a
canonical protein sequence in UniProtKB/Swiss-Prot.● Most of these protein sequences are now mapped to the reference
genome assembly produced by the international Genome Reference
Consortium (GRC)(1).● As described in a separate poster, UniProt manually annotate variant
sequences with known functional consequences from the literature.● Studies on sequence variation are increasing; projects like 1000
Genomes(2) and the Cancer Genome Project(3) are generating a vast
amount of variant information that is, or will be, stored by Ensembl
variation(4) in their databases.● This exponential growth of variation information means a new automated
strategy is required for the selection and importing of biomedical sciences
and clinical relevant variants from Ensembl variation into UniProtKB.
● The imported variant information provides amino acid mutations,
observed within the sampled population, for specific protein sequence
isoforms, Figure 2.● Biochemical and biomedical consequences of germline and somatic
variants are defined through well characterised phenotypes observed in
sample populations, Figure 3.● Prediction of the potential consequence of a variant on an individual
when a phenotype for an observed variant in a population has not been
isolated.
Figure 3:The Good and the Bad biochemical and biomedical consequences of missense variants.
Figure 2: Examples of the new information provided by the new variant import pipeline.
Figure 3. Mapping of the Ensembl Human GSTZ1 gene to the protein sequences of UniProtKB's MAAI_HUMAN Entry O43708
Figure 4. Examples of the new informa5on provided by the new variant import pipeline.
Now that UniProt has the human reference proteome mapped to the human reference genome, UniProt has developed a pipeline to import high-quality 1000 Genomes and COSMIC non-synonymous single amino acid variants from Ensembl variation, Figure 4. 389,935 single amino acid variants have been identified for import into the UniProt human reference proteome.
Mapping the Complete Human Proteome to the Reference Genome
Mapping Variants to the UniProt Human Reference Proteome
Mapping the UniProt Human Reference Proteome
to the Reference Genome and Variation Data
Andrew Nightingale1, Jie Luo1, Maria Martin1 and the UniProt Consortium1, 2, 3
1EMBL-European Bioinformatics Institute, Cambridge, UK
2SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
3Protein Information Resource, Georgetown University, Washington DC & University of Delaware, USA
Email: [email protected]
URL: www.uniprot.org
Funding
UniProt is funded by the European Molecular Biology Laboratory, National Institutes of Health, European Union, Swiss Federal Government, British Heart Foundation and National Science Foundation.
1) The Genome Reference Consortium, http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/index.shtml
2) Durbin, R. et al., A map of human genome variation from population scale sequencing., Nature 467(7319):1061 (2010)
3) Cancer Genome Project: http://www.sanger.ac.uk/research/projects/cancergenome/
4) Chen Y. et al., Ensembl Variation Resources., BMC Genomics 11(1):293 (2010)
5) Simon A. Forbes et al., COSMIC: http://cancer.sanger.ac.uk
ReferencesFuture Developments● UniProt intend to extend the variant import pipeline to include other
species with a complete proteome.
Mapping Variants to the UniProt
Human Reference Proteome
Invaluable Information Provided by
Variants
● Now that UniProt has the human reference proteome mapped to the
human reference genome, UniProt has developed a pipeline to import
high-quality 1000 Genomes(2) and COSMIC(5) non-synonymous single
amino acid variants from Ensembl variation(4), Figure 2.● 389,935 single amino acid variants have been identified for import into
the UniProt human reference proteome.
Introduction Mapping the Complete Human
Proteome to the Reference Genome● Mapping was made possible through a process of aligning all human
protein sequences in the UniProt Knowledgebase (UniProtKB) to the
protein translations in Ensembl, based on 100% amino acid identity over
the entire sequence.● Using an example to demonstrate the mapping pipeline in more detail,
Figure 1. The transcription and translation products of the human gene
GSTZ1 are mapped to the two existing isoforms of MAAI_HUMAN by
amino acid identity. The first transcript is mapped to isoform-1. Two
separate gene transcripts with identical amino acid sequence are mapped
to isoform-2. A forth gene product is mapped to an existing TrEMBL entry,
A6NED0; whilst a final unknown product is not found in UniProtKB and
therefore is added to UniProtKB as a new TrEMBL record.
Figure 1: Mapping of the Ensembl Human BRCA1 gene to the protein sequences of UniProtKB's
BRCA1_HUMAN Entry P38398.
● UniProt has annotated the complete Homo sapiens proteome and
approximately 20,000 protein coding genes are represented by a
canonical protein sequence in UniProtKB/Swiss-Prot.● Most of these protein sequences are now mapped to the reference
genome assembly produced by the international Genome Reference
Consortium (GRC)(1).● As described in a separate poster, UniProt manually annotate variant
sequences with known functional consequences from the literature.● Studies on sequence variation are increasing; projects like 1000
Genomes(2) and the Cancer Genome Project(3) are generating a vast
amount of variant information that is, or will be, stored by Ensembl
variation(4) in their databases.● This exponential growth of variation information means a new automated
strategy is required for the selection and importing of biomedical sciences
and clinical relevant variants from Ensembl variation into UniProtKB.
● The imported variant information provides amino acid mutations,
observed within the sampled population, for specific protein sequence
isoforms, Figure 2.● Biochemical and biomedical consequences of germline and somatic
variants are defined through well characterised phenotypes observed in
sample populations, Figure 3.● Prediction of the potential consequence of a variant on an individual
when a phenotype for an observed variant in a population has not been
isolated.
Figure 3:The Good and the Bad biochemical and biomedical consequences of missense variants.
Figure 2: Examples of the new information provided by the new variant import pipeline.
UniProt has defined a set of Reference Proteomes which are ‘landmarks’ in proteome space. Reference proteomes are selected to provide broad coverage of the tree of life, and constitute a representative cross-section of the taxonomic diversity to be found within UniProtKB. They include the proteomes of well-studied model organisms [including those in the now defunct IPI sets] and other proteomes of interest for biomedical and biotechnological research. These are the proteomes which are preferentially selected for manual curation when resources permit. Species of particular importance may be represented by numerous reference proteomes for specific ecotypes or strains of interest.
Reference Proteomes coming out of the line
Since 2009 UniProt team is working in collaboration with QfO consortium in order to provide a gene centric benchmark data set for reference proteomes. This dataset is released once a year and concern 147 species for 2013 base on UniprotKB release 2013_04. This benchmark provide to the ortholog community a standard data set to effectively compare their methods. For each of the proteomes we provide : one fasta file containing non-redundant FASTA sets for the canonical sequences, gene to protein mapping file and idmapping containing all cross-references link to those proteins. See http://www.ebi.ac.uk/reference_proteomes/.
Quest for Orthologs (QfO), a gene centric view
UniProt Consor5um. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res., 40:D71-‐5, 2012. T. Gabaldon, C. Dessimoz, J. Huxley-‐Jones, A. Vilella, E. Sonnhammer and S. Lewis. Joining forces in the quest for orthologs. Genome Biology, 9:403, 2009. 1000 Genomes Project Consor5um. A map of human genome varia=on from popula=on-‐scale sequencing. Nature 467 (7319): 1061–1073, 2010. SA. Forbes, G. Bhamra, S. Bamford, E. Dawson, C. Kok, et al. The Catalogue of Soma=c Muta=ons in Cancer (COSMIC). Curr Protoc Hum Genet, Chapter 10: Unit 10-‐11, 2008.