1
EMBL-EBI Tel. +44 (0) 1223 494 444 Wellcome Trust Genome Campus [email protected] Hinxton, Cambridgeshire, CB10 1SD, UK www.ebi.ac.uk How UniProtKB Maps Genomes And Variants and Provides This Information. Hermann ZELLNER , Benoît BELY , Andrew NIGHTINGALE, Alan WILTER SOUSA DA SILVA and Maria JESUS MARTIN EMBL - European Bioinformatics Institute, Welcome Trust Genome Campus,CB10 1SD Hinxton, Cambridge, United Kingdom Complete Proteomes for Complete Genomes UniProt is mainly supported by the Na5onal Ins5tutes of Health (NIH) grant 1U41HG00610401. Addi5onal support for the EMBLEBI's involvement in UniProt comes from EMBL and the NIH GO grant 2P41HG0227307. UniProt ac5vi5es at the SIB are addi5onally supported by the Swiss Federal Government through the State Secretariat for Educa5on, Research and Innova5on SERI, and by the EC grants GEN2PHEN (200754) and MICROME (2228862). PIR's UniProt ac5vi5es are also supported by the NIH grants 5R01GM08064607, 3R01GM08064607S1, 5G08LM01072003, and 8P20GM10344612, and the Na5onal Science Founda5on (NSF) grant DBI1062520. Human Proteome, The exception proves the rule A new interface will be introduced to seamlessly integrate UniProt taxonomy to the associated proteomes. The expanded view for individual taxa will allow users to choose between alternate complete proteomes available for many species. In addition, Reference proteomes will aid users in making an informed choice. New space, new concept, new interface Figure 1. Proteome pipeline. How complete and reference proteomes are made. Using an example to demonstrate the mapping pipeline in more detail, Figure 3. The transcription and translation products of the human gene GSTZ1 are mapped to the two existing isoforms of MAAI_HUMAN by amino acid identity. The first transcript is mapped to isoform-1. Two separate gene transcripts with identical amino acid sequence are mapped to isoform-2. A forth gene product is mapped to an existing TrEMBL entry, A6NED0; whilst a final unknown product is not found in UniProtKB and therefore is added to UniProtKB as a new TrEMBL record. UniProt has annotated the complete Homo sapiens proteome and approximately 20,000 protein coding genes are represented by a canonical protein sequence in UniProtKB/Swiss- Prot. Studies on sequence variation are increasing; projects like 1000 Genomes and the Cancer Genome Project are generating a vast amount of variant information that is, or will be, stored by Ensembl in their variation databases. This exponential growth of variation information means a new automated strategy is required for the selection and importing of biomedical sciences and clinical relevant variants from Ensembl variation into UniProtKB. Figure 2. New proteome interface. The proposed new interface will allow users a choice of proteomes where available, with one or more designated ‘Reference’ proteomes. The source of the UniProtKB complete proteomes are genomes from INSDC and Ensembl and Ensembl Genomes. For INSDC, all annotated proteins are imported in UniProtKB (UniProtKB/TrEMBL) but only proteins coming from complete, annotated genomes and WGS genomes detected as complete will be tagged with the keyword Complete Proteome. For Ensembl, all predicted protein sequences are mapped to UniProtKB under stringent conditions: 100% identity over 100% of the length of the two sequences. Any sequence found to be absent from UniProtKB is imported. All UniProtKB entries that map to an Ensembl peptide are used to build the proteome; they are tagged and a cross-reference is added. Figure 3. Mapping of the Ensembl Human GSTZ1 gene to the protein sequences of UniProtKB's MAAI_HUMAN Entry O43708 Figure 4. Examples of the new informa5on provided by the new variant import pipeline. Now that UniProt has the human reference proteome mapped to the human reference genome, UniProt has developed a pipeline to import high-quality 1000 Genomes and COSMIC non-synonymous single amino acid variants from Ensembl variation, Figure 4. 389,935 single amino acid variants have been identified for import into the UniProt human reference proteome. Mapping the Complete Human Proteome to the Reference Genome Mapping Variants to the UniProt Human Reference Proteome UniProt has defined a set of Reference Proteomes which are ‘landmarks’ in proteome space. Reference proteomes are selected to provide broad coverage of the tree of life, and constitute a representative cross-section of the taxonomic diversity to be found within UniProtKB. They include the proteomes of well-studied model organisms [including those in the now defunct IPI sets] and other proteomes of interest for biomedical and biotechnological research. These are the proteomes which are preferentially selected for manual curation when resources permit. Species of particular importance may be represented by numerous reference proteomes for specific ecotypes or strains of interest. Reference Proteomes coming out of the line Since 2009 UniProt team is working in collaboration with QfO consortium in order to provide a gene centric benchmark data set for reference proteomes. This dataset is released once a year and concern 147 species for 2013 base on UniprotKB release 2013_04. This benchmark provide to the ortholog community a standard data set to effectively compare their methods. For each of the proteomes we provide : one fasta file containing non-redundant FASTA sets for the canonical sequences, gene to protein mapping file and idmapping containing all cross-references link to those proteins. See http://www.ebi.ac.uk/reference_proteomes/. Quest for Orthologs (QfO), a gene centric view UniProt Consor5um. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res., 40:D715, 2012. T. Gabaldon, C. Dessimoz, J. HuxleyJones, A. Vilella, E. Sonnhammer and S. Lewis. Joining forces in the quest for orthologs. Genome Biology, 9:403, 2009. 1000 Genomes Project Consor5um. A map of human genome varia=on from popula=onscale sequencing. Nature 467 (7319): 1061–1073, 2010. SA. Forbes, G. Bhamra, S. Bamford, E. Dawson, C. Kok, et al. The Catalogue of Soma=c Muta=ons in Cancer (COSMIC). Curr Protoc Hum Genet, Chapter 10: Unit 1011, 2008.

How UniProtKB Maps Genomes And Variants and Provides … · How UniProtKB Maps Genomes And Variants and Provides This Information. Hermann ZELLNER, Benoît BELY, Andrew NIGHTINGALE,

Embed Size (px)

Citation preview

Page 1: How UniProtKB Maps Genomes And Variants and Provides … · How UniProtKB Maps Genomes And Variants and Provides This Information. Hermann ZELLNER, Benoît BELY, Andrew NIGHTINGALE,

EMBL-EBI Tel. +44 (0) 1223 494 444 Wellcome Trust Genome Campus [email protected] Hinxton, Cambridgeshire, CB10 1SD, UK www.ebi.ac.uk

How UniProtKB Maps Genomes And Variants and Provides This Information.

Hermann ZELLNER, Benoît BELY, Andrew NIGHTINGALE, Alan WILTER SOUSA DA SILVA and Maria JESUS MARTIN EMBL - European Bioinformatics Institute, Welcome Trust Genome Campus,CB10 1SD Hinxton, Cambridge, United Kingdom

Complete Proteomes for Complete Genomes

UniProt   is   mainly   supported   by   the   Na5onal   Ins5tutes   of   Health   (NIH)   grant   1U41HG006104-­‐01.   Addi5onal   support   for   the   EMBL-­‐EBI's   involvement   in  UniProt  comes  from  EMBL  and  the  NIH  GO  grant  2P41HG02273-­‐07.  UniProt  ac5vi5es  at  the  SIB  are  addi5onally  supported  by  the  Swiss  Federal  Government  through   the   State   Secretariat   for   Educa5on,   Research   and   Innova5on   SERI,   and   by   the   EC   grants   GEN2PHEN   (200754)   and  MICROME   (222886-­‐2).   PIR's  UniProt   ac5vi5es   are   also   supported   by   the   NIH   grants   5R01GM080646-­‐07,   3R01GM080646-­‐07S1,   5G08LM010720-­‐03,   and   8P20GM103446-­‐12,   and   the  Na5onal  Science  Founda5on  (NSF)  grant  DBI-­‐1062520.  

Human Proteome, The exception proves the rule

A new interface will be introduced to seamlessly integrate UniProt taxonomy to the associated proteomes. The expanded view for individual taxa will allow users to choose between alternate complete proteomes available for many species. In addition, Reference proteomes will aid users in making an informed choice.

New space, new concept, new interface

Figure  1.  Proteome  pipeline.    How  complete  and  reference  proteomes  are  made.  

Using an example to demonstrate the mapping pipeline in more detail, Figure 3. The transcription and translation products of the human gene GSTZ1 are mapped to the two existing isoforms of MAAI_HUMAN by amino acid identity. The first transcript is mapped to isoform-1. Two separate gene transcripts with identical amino acid sequence are mapped to isoform-2. A forth gene product is mapped to an existing TrEMBL entry, A6NED0; whilst a final unknown product is not found in UniProtKB and therefore is added to UniProtKB as a new TrEMBL record.

UniProt has annotated the complete Homo sapiens proteome and approximately 20,000 protein coding genes are represented by a canonical protein sequence in UniProtKB/Swiss-Prot. Studies on sequence variation are increasing; projects like 1000 Genomes and the Cancer Genome Project are generating a vast amount of variant information that is, or will be, stored by Ensembl in their variation databases. This exponential growth of variation information means a new automated strategy is required for the selection and importing of biomedical sciences and clinical relevant variants from Ensembl variation into UniProtKB.

Figure  2.  New  proteome  interface.  The  proposed  new  interface  will  allow  users  a  choice  of  proteomes  where  available,  with  one  or  more  designated  ‘Reference’  proteomes.  

The source of the UniProtKB complete proteomes are genomes from INSDC and Ensembl and Ensembl Genomes. For INSDC, all annotated proteins are imported in UniProtKB (UniProtKB/TrEMBL) but only proteins coming from complete, annotated genomes and WGS genomes detected as complete will be tagged with the keyword Complete Proteome. For Ensembl, all predicted protein sequences are mapped to UniProtKB under stringent conditions: 100% identity over 100% of the length of the two sequences. Any sequence found to be absent from UniProtKB is imported. All UniProtKB entries that map to an Ensembl peptide are used to build the proteome; they are tagged and a cross-reference is added.

Mapping the UniProt Human Reference Proteome

to the Reference Genome and Variation Data

Andrew Nightingale1, Jie Luo1, Maria Martin1 and the UniProt Consortium1, 2, 3

1EMBL-European Bioinformatics Institute, Cambridge, UK

2SIB Swiss Institute of Bioinformatics, Geneva, Switzerland

3Protein Information Resource, Georgetown University, Washington DC & University of Delaware, USA

Email: [email protected]

URL: www.uniprot.org

Funding

UniProt is funded by the European Molecular Biology Laboratory, National Institutes of Health, European Union, Swiss Federal Government, British Heart Foundation and National Science Foundation.

1) The Genome Reference Consortium, http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/index.shtml

2) Durbin, R. et al., A map of human genome variation from population scale sequencing., Nature 467(7319):1061 (2010)

3) Cancer Genome Project: http://www.sanger.ac.uk/research/projects/cancergenome/

4) Chen Y. et al., Ensembl Variation Resources., BMC Genomics 11(1):293 (2010)

5) Simon A. Forbes et al., COSMIC: http://cancer.sanger.ac.uk

ReferencesFuture Developments● UniProt intend to extend the variant import pipeline to include other

species with a complete proteome.

Mapping Variants to the UniProt

Human Reference Proteome

Invaluable Information Provided by

Variants

● Now that UniProt has the human reference proteome mapped to the

human reference genome, UniProt has developed a pipeline to import

high-quality 1000 Genomes(2) and COSMIC(5) non-synonymous single

amino acid variants from Ensembl variation(4), Figure 2.● 389,935 single amino acid variants have been identified for import into

the UniProt human reference proteome.

Introduction Mapping the Complete Human

Proteome to the Reference Genome● Mapping was made possible through a process of aligning all human

protein sequences in the UniProt Knowledgebase (UniProtKB) to the

protein translations in Ensembl, based on 100% amino acid identity over

the entire sequence.● Using an example to demonstrate the mapping pipeline in more detail,

Figure 1. The transcription and translation products of the human gene

GSTZ1 are mapped to the two existing isoforms of MAAI_HUMAN by

amino acid identity. The first transcript is mapped to isoform-1. Two

separate gene transcripts with identical amino acid sequence are mapped

to isoform-2. A forth gene product is mapped to an existing TrEMBL entry,

A6NED0; whilst a final unknown product is not found in UniProtKB and

therefore is added to UniProtKB as a new TrEMBL record.

Figure 1: Mapping of the Ensembl Human BRCA1 gene to the protein sequences of UniProtKB's

BRCA1_HUMAN Entry P38398.

● UniProt has annotated the complete Homo sapiens proteome and

approximately 20,000 protein coding genes are represented by a

canonical protein sequence in UniProtKB/Swiss-Prot.● Most of these protein sequences are now mapped to the reference

genome assembly produced by the international Genome Reference

Consortium (GRC)(1).● As described in a separate poster, UniProt manually annotate variant

sequences with known functional consequences from the literature.● Studies on sequence variation are increasing; projects like 1000

Genomes(2) and the Cancer Genome Project(3) are generating a vast

amount of variant information that is, or will be, stored by Ensembl

variation(4) in their databases.● This exponential growth of variation information means a new automated

strategy is required for the selection and importing of biomedical sciences

and clinical relevant variants from Ensembl variation into UniProtKB.

● The imported variant information provides amino acid mutations,

observed within the sampled population, for specific protein sequence

isoforms, Figure 2.● Biochemical and biomedical consequences of germline and somatic

variants are defined through well characterised phenotypes observed in

sample populations, Figure 3.● Prediction of the potential consequence of a variant on an individual

when a phenotype for an observed variant in a population has not been

isolated.

Figure 3:The Good and the Bad biochemical and biomedical consequences of missense variants.

Figure 2: Examples of the new information provided by the new variant import pipeline.

Figure   3.  Mapping   of   the   Ensembl  Human  GSTZ1   gene   to   the   protein   sequences   of  UniProtKB's  MAAI_HUMAN  Entry  O43708  

Figure  4.  Examples  of  the  new  informa5on  provided  by  the  new  variant  import  pipeline.  

Now that UniProt has the human reference proteome mapped to the human reference genome, UniProt has developed a pipeline to import high-quality 1000 Genomes and COSMIC non-synonymous single amino acid variants from Ensembl variation, Figure 4. 389,935 single amino acid variants have been identified for import into the UniProt human reference proteome.

Mapping the Complete Human Proteome to the Reference Genome

Mapping Variants to the UniProt Human Reference Proteome

Mapping the UniProt Human Reference Proteome

to the Reference Genome and Variation Data

Andrew Nightingale1, Jie Luo1, Maria Martin1 and the UniProt Consortium1, 2, 3

1EMBL-European Bioinformatics Institute, Cambridge, UK

2SIB Swiss Institute of Bioinformatics, Geneva, Switzerland

3Protein Information Resource, Georgetown University, Washington DC & University of Delaware, USA

Email: [email protected]

URL: www.uniprot.org

Funding

UniProt is funded by the European Molecular Biology Laboratory, National Institutes of Health, European Union, Swiss Federal Government, British Heart Foundation and National Science Foundation.

1) The Genome Reference Consortium, http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/index.shtml

2) Durbin, R. et al., A map of human genome variation from population scale sequencing., Nature 467(7319):1061 (2010)

3) Cancer Genome Project: http://www.sanger.ac.uk/research/projects/cancergenome/

4) Chen Y. et al., Ensembl Variation Resources., BMC Genomics 11(1):293 (2010)

5) Simon A. Forbes et al., COSMIC: http://cancer.sanger.ac.uk

ReferencesFuture Developments● UniProt intend to extend the variant import pipeline to include other

species with a complete proteome.

Mapping Variants to the UniProt

Human Reference Proteome

Invaluable Information Provided by

Variants

● Now that UniProt has the human reference proteome mapped to the

human reference genome, UniProt has developed a pipeline to import

high-quality 1000 Genomes(2) and COSMIC(5) non-synonymous single

amino acid variants from Ensembl variation(4), Figure 2.● 389,935 single amino acid variants have been identified for import into

the UniProt human reference proteome.

Introduction Mapping the Complete Human

Proteome to the Reference Genome● Mapping was made possible through a process of aligning all human

protein sequences in the UniProt Knowledgebase (UniProtKB) to the

protein translations in Ensembl, based on 100% amino acid identity over

the entire sequence.● Using an example to demonstrate the mapping pipeline in more detail,

Figure 1. The transcription and translation products of the human gene

GSTZ1 are mapped to the two existing isoforms of MAAI_HUMAN by

amino acid identity. The first transcript is mapped to isoform-1. Two

separate gene transcripts with identical amino acid sequence are mapped

to isoform-2. A forth gene product is mapped to an existing TrEMBL entry,

A6NED0; whilst a final unknown product is not found in UniProtKB and

therefore is added to UniProtKB as a new TrEMBL record.

Figure 1: Mapping of the Ensembl Human BRCA1 gene to the protein sequences of UniProtKB's

BRCA1_HUMAN Entry P38398.

● UniProt has annotated the complete Homo sapiens proteome and

approximately 20,000 protein coding genes are represented by a

canonical protein sequence in UniProtKB/Swiss-Prot.● Most of these protein sequences are now mapped to the reference

genome assembly produced by the international Genome Reference

Consortium (GRC)(1).● As described in a separate poster, UniProt manually annotate variant

sequences with known functional consequences from the literature.● Studies on sequence variation are increasing; projects like 1000

Genomes(2) and the Cancer Genome Project(3) are generating a vast

amount of variant information that is, or will be, stored by Ensembl

variation(4) in their databases.● This exponential growth of variation information means a new automated

strategy is required for the selection and importing of biomedical sciences

and clinical relevant variants from Ensembl variation into UniProtKB.

● The imported variant information provides amino acid mutations,

observed within the sampled population, for specific protein sequence

isoforms, Figure 2.● Biochemical and biomedical consequences of germline and somatic

variants are defined through well characterised phenotypes observed in

sample populations, Figure 3.● Prediction of the potential consequence of a variant on an individual

when a phenotype for an observed variant in a population has not been

isolated.

Figure 3:The Good and the Bad biochemical and biomedical consequences of missense variants.

Figure 2: Examples of the new information provided by the new variant import pipeline.

UniProt has defined a set of Reference Proteomes which are ‘landmarks’ in proteome space. Reference proteomes are selected to provide broad coverage of the tree of life, and constitute a representative cross-section of the taxonomic diversity to be found within UniProtKB. They include the proteomes of well-studied model organisms [including those in the now defunct IPI sets] and other proteomes of interest for biomedical and biotechnological research. These are the proteomes which are preferentially selected for manual curation when resources permit. Species of particular importance may be represented by numerous reference proteomes for specific ecotypes or strains of interest.

Reference Proteomes coming out of the line

Since 2009 UniProt team is working in collaboration with QfO consortium in order to provide a gene centric benchmark data set for reference proteomes. This dataset is released once a year and concern 147 species for 2013 base on UniprotKB release 2013_04. This benchmark provide to the ortholog community a standard data set to effectively compare their methods. For each of the proteomes we provide : one fasta file containing non-redundant FASTA sets for the canonical sequences, gene to protein mapping file and idmapping containing all cross-references link to those proteins. See http://www.ebi.ac.uk/reference_proteomes/.

Quest for Orthologs (QfO), a gene centric view

UniProt  Consor5um.  Reorganizing  the  protein  space  at  the  Universal  Protein  Resource  (UniProt).  Nucleic  Acids  Res.,  40:D71-­‐5,  2012.  T.  Gabaldon,  C.  Dessimoz,  J.  Huxley-­‐Jones,  A.  Vilella,  E.  Sonnhammer  and  S.  Lewis.  Joining  forces  in  the  quest  for  orthologs.  Genome  Biology,  9:403,  2009.  1000  Genomes  Project  Consor5um.  A  map  of  human  genome  varia=on  from  popula=on-­‐scale  sequencing.  Nature  467  (7319):  1061–1073,  2010.  SA.  Forbes,  G.  Bhamra,  S.  Bamford,  E.  Dawson,  C.  Kok,  et  al.  The  Catalogue  of  Soma=c  Muta=ons  in  Cancer  (COSMIC).  Curr  Protoc  Hum  Genet,  Chapter  10:  Unit  10-­‐11,  2008.