How UniProtKB Maps Genomes And Variants and Provides … · How UniProtKB Maps Genomes And Variants and Provides This Information. Hermann ZELLNER, Benoît BELY, Andrew NIGHTINGALE,

EMBL-EBI Tel. +44 (0) 1223 494 444 Wellcome Trust Genome Campus [email protected] Hinxton, Cambridgeshire, CB10 1SD, UK www.ebi.ac.uk

How UniProtKB Maps Genomes And Variants and Provides This Information.

Hermann ZELLNER, Benoît BELY, Andrew NIGHTINGALE, Alan WILTER SOUSA DA SILVA and Maria JESUS MARTIN EMBL - European Bioinformatics Institute, Welcome Trust Genome Campus,CB10 1SD Hinxton, Cambridge, United Kingdom

Complete Proteomes for Complete Genomes

UniProt is mainly supported by the Na5onal Ins5tutes of Health (NIH) grant 1U41HG006104-‐01. Addi5onal support for the EMBL-‐EBI's involvement in UniProt comes from EMBL and the NIH GO grant 2P41HG02273-‐07. UniProt ac5vi5es at the SIB are addi5onally supported by the Swiss Federal Government through the State Secretariat for Educa5on, Research and Innova5on SERI, and by the EC grants GEN2PHEN (200754) and MICROME (222886-‐2). PIR's UniProt ac5vi5es are also supported by the NIH grants 5R01GM080646-‐07, 3R01GM080646-‐07S1, 5G08LM010720-‐03, and 8P20GM103446-‐12, and the Na5onal Science Founda5on (NSF) grant DBI-‐1062520.

Human Proteome, The exception proves the rule

A new interface will be introduced to seamlessly integrate UniProt taxonomy to the associated proteomes. The expanded view for individual taxa will allow users to choose between alternate complete proteomes available for many species. In addition, Reference proteomes will aid users in making an informed choice.

New space, new concept, new interface

Figure 1. Proteome pipeline. How complete and reference proteomes are made.

Using an example to demonstrate the mapping pipeline in more detail, Figure 3. The transcription and translation products of the human gene GSTZ1 are mapped to the two existing isoforms of MAAI_HUMAN by amino acid identity. The first transcript is mapped to isoform-1. Two separate gene transcripts with identical amino acid sequence are mapped to isoform-2. A forth gene product is mapped to an existing TrEMBL entry, A6NED0; whilst a final unknown product is not found in UniProtKB and therefore is added to UniProtKB as a new TrEMBL record.

UniProt has annotated the complete Homo sapiens proteome and approximately 20,000 protein coding genes are represented by a canonical protein sequence in UniProtKB/Swiss-Prot. Studies on sequence variation are increasing; projects like 1000 Genomes and the Cancer Genome Project are generating a vast amount of variant information that is, or will be, stored by Ensembl in their variation databases. This exponential growth of variation information means a new automated strategy is required for the selection and importing of biomedical sciences and clinical relevant variants from Ensembl variation into UniProtKB.

Figure 2. New proteome interface. The proposed new interface will allow users a choice of proteomes where available, with one or more designated ‘Reference’ proteomes.

The source of the UniProtKB complete proteomes are genomes from INSDC and Ensembl and Ensembl Genomes. For INSDC, all annotated proteins are imported in UniProtKB (UniProtKB/TrEMBL) but only proteins coming from complete, annotated genomes and WGS genomes detected as complete will be tagged with the keyword Complete Proteome. For Ensembl, all predicted protein sequences are mapped to UniProtKB under stringent conditions: 100% identity over 100% of the length of the two sequences. Any sequence found to be absent from UniProtKB is imported. All UniProtKB entries that map to an Ensembl peptide are used to build the proteome; they are tagged and a cross-reference is added.

Mapping the UniProt Human Reference Proteome

to the Reference Genome and Variation Data

Andrew Nightingale1, Jie Luo1, Maria Martin1 and the UniProt Consortium1, 2, 3

1EMBL-European Bioinformatics Institute, Cambridge, UK

2SIB Swiss Institute of Bioinformatics, Geneva, Switzerland

3Protein Information Resource, Georgetown University, Washington DC & University of Delaware, USA

Email: [email protected]

URL: www.uniprot.org

Funding

UniProt is funded by the European Molecular Biology Laboratory, National Institutes of Health, European Union, Swiss Federal Government, British Heart Foundation and National Science Foundation.

1) The Genome Reference Consortium, http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/index.shtml

2) Durbin, R. et al., A map of human genome variation from population scale sequencing., Nature 467(7319):1061 (2010)

3) Cancer Genome Project: http://www.sanger.ac.uk/research/projects/cancergenome/

4) Chen Y. et al., Ensembl Variation Resources., BMC Genomics 11(1):293 (2010)

5) Simon A. Forbes et al., COSMIC: http://cancer.sanger.ac.uk

ReferencesFuture Developments● UniProt intend to extend the variant import pipeline to include other

species with a complete proteome.

Mapping Variants to the UniProt

Human Reference Proteome

Invaluable Information Provided by

Variants

● Now that UniProt has the human reference proteome mapped to the

human reference genome, UniProt has developed a pipeline to import

high-quality 1000 Genomes(2) and COSMIC(5) non-synonymous single

amino acid variants from Ensembl variation(4), Figure 2.● 389,935 single amino acid variants have been identified for import into

the UniProt human reference proteome.

Introduction Mapping the Complete Human

Proteome to the Reference Genome● Mapping was made possible through a process of aligning all human

protein sequences in the UniProt Knowledgebase (UniProtKB) to the

protein translations in Ensembl, based on 100% amino acid identity over

the entire sequence.● Using an example to demonstrate the mapping pipeline in more detail,

Figure 1. The transcription and translation products of the human gene

GSTZ1 are mapped to the two existing isoforms of MAAI_HUMAN by

amino acid identity. The first transcript is mapped to isoform-1. Two

separate gene transcripts with identical amino acid sequence are mapped

to isoform-2. A forth gene product is mapped to an existing TrEMBL entry,

A6NED0; whilst a final unknown product is not found in UniProtKB and

therefore is added to UniProtKB as a new TrEMBL record.

Figure 1: Mapping of the Ensembl Human BRCA1 gene to the protein sequences of UniProtKB's

BRCA1_HUMAN Entry P38398.

● UniProt has annotated the complete Homo sapiens proteome and

approximately 20,000 protein coding genes are represented by a

canonical protein sequence in UniProtKB/Swiss-Prot.● Most of these protein sequences are now mapped to the reference

genome assembly produced by the international Genome Reference

Consortium (GRC)(1).● As described in a separate poster, UniProt manually annotate variant

sequences with known functional consequences from the literature.● Studies on sequence variation are increasing; projects like 1000

Genomes(2) and the Cancer Genome Project(3) are generating a vast

amount of variant information that is, or will be, stored by Ensembl

variation(4) in their databases.● This exponential growth of variation information means a new automated

strategy is required for the selection and importing of biomedical sciences

and clinical relevant variants from Ensembl variation into UniProtKB.

● The imported variant information provides amino acid mutations,

observed within the sampled population, for specific protein sequence

isoforms, Figure 2.● Biochemical and biomedical consequences of germline and somatic

variants are defined through well characterised phenotypes observed in

sample populations, Figure 3.● Prediction of the potential consequence of a variant on an individual

when a phenotype for an observed variant in a population has not been

isolated.

Figure 3:The Good and the Bad biochemical and biomedical consequences of missense variants.

Figure 2: Examples of the new information provided by the new variant import pipeline.

Figure 3. Mapping of the Ensembl Human GSTZ1 gene to the protein sequences of UniProtKB's MAAI_HUMAN Entry O43708

Figure 4. Examples of the new informa5on provided by the new variant import pipeline.

Now that UniProt has the human reference proteome mapped to the human reference genome, UniProt has developed a pipeline to import high-quality 1000 Genomes and COSMIC non-synonymous single amino acid variants from Ensembl variation, Figure 4. 389,935 single amino acid variants have been identified for import into the UniProt human reference proteome.

Mapping the Complete Human Proteome to the Reference Genome

Mapping Variants to the UniProt Human Reference Proteome

Mapping the UniProt Human Reference Proteome

to the Reference Genome and Variation Data

Andrew Nightingale1, Jie Luo1, Maria Martin1 and the UniProt Consortium1, 2, 3

1EMBL-European Bioinformatics Institute, Cambridge, UK

2SIB Swiss Institute of Bioinformatics, Geneva, Switzerland

3Protein Information Resource, Georgetown University, Washington DC & University of Delaware, USA

Email: [email protected]

URL: www.uniprot.org

Funding

UniProt is funded by the European Molecular Biology Laboratory, National Institutes of Health, European Union, Swiss Federal Government, British Heart Foundation and National Science Foundation.

1) The Genome Reference Consortium, http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/index.shtml

2) Durbin, R. et al., A map of human genome variation from population scale sequencing., Nature 467(7319):1061 (2010)

3) Cancer Genome Project: http://www.sanger.ac.uk/research/projects/cancergenome/

4) Chen Y. et al., Ensembl Variation Resources., BMC Genomics 11(1):293 (2010)

5) Simon A. Forbes et al., COSMIC: http://cancer.sanger.ac.uk

ReferencesFuture Developments● UniProt intend to extend the variant import pipeline to include other

species with a complete proteome.

Mapping Variants to the UniProt

Human Reference Proteome

Invaluable Information Provided by

Variants

● Now that UniProt has the human reference proteome mapped to the

human reference genome, UniProt has developed a pipeline to import

high-quality 1000 Genomes(2) and COSMIC(5) non-synonymous single

amino acid variants from Ensembl variation(4), Figure 2.● 389,935 single amino acid variants have been identified for import into

the UniProt human reference proteome.

Introduction Mapping the Complete Human

Proteome to the Reference Genome● Mapping was made possible through a process of aligning all human

protein sequences in the UniProt Knowledgebase (UniProtKB) to the

protein translations in Ensembl, based on 100% amino acid identity over

the entire sequence.● Using an example to demonstrate the mapping pipeline in more detail,

Figure 1. The transcription and translation products of the human gene

GSTZ1 are mapped to the two existing isoforms of MAAI_HUMAN by

amino acid identity. The first transcript is mapped to isoform-1. Two

separate gene transcripts with identical amino acid sequence are mapped

to isoform-2. A forth gene product is mapped to an existing TrEMBL entry,

A6NED0; whilst a final unknown product is not found in UniProtKB and

therefore is added to UniProtKB as a new TrEMBL record.

Figure 1: Mapping of the Ensembl Human BRCA1 gene to the protein sequences of UniProtKB's

BRCA1_HUMAN Entry P38398.

● UniProt has annotated the complete Homo sapiens proteome and

approximately 20,000 protein coding genes are represented by a

canonical protein sequence in UniProtKB/Swiss-Prot.● Most of these protein sequences are now mapped to the reference

genome assembly produced by the international Genome Reference

Consortium (GRC)(1).● As described in a separate poster, UniProt manually annotate variant

sequences with known functional consequences from the literature.● Studies on sequence variation are increasing; projects like 1000

Genomes(2) and the Cancer Genome Project(3) are generating a vast

amount of variant information that is, or will be, stored by Ensembl

variation(4) in their databases.● This exponential growth of variation information means a new automated

strategy is required for the selection and importing of biomedical sciences

and clinical relevant variants from Ensembl variation into UniProtKB.

● The imported variant information provides amino acid mutations,

observed within the sampled population, for specific protein sequence

isoforms, Figure 2.● Biochemical and biomedical consequences of germline and somatic

variants are defined through well characterised phenotypes observed in

sample populations, Figure 3.● Prediction of the potential consequence of a variant on an individual

when a phenotype for an observed variant in a population has not been

isolated.

Figure 3:The Good and the Bad biochemical and biomedical consequences of missense variants.

Figure 2: Examples of the new information provided by the new variant import pipeline.

UniProt has defined a set of Reference Proteomes which are ‘landmarks’ in proteome space. Reference proteomes are selected to provide broad coverage of the tree of life, and constitute a representative cross-section of the taxonomic diversity to be found within UniProtKB. They include the proteomes of well-studied model organisms [including those in the now defunct IPI sets] and other proteomes of interest for biomedical and biotechnological research. These are the proteomes which are preferentially selected for manual curation when resources permit. Species of particular importance may be represented by numerous reference proteomes for specific ecotypes or strains of interest.

Reference Proteomes coming out of the line

Since 2009 UniProt team is working in collaboration with QfO consortium in order to provide a gene centric benchmark data set for reference proteomes. This dataset is released once a year and concern 147 species for 2013 base on UniprotKB release 2013_04. This benchmark provide to the ortholog community a standard data set to effectively compare their methods. For each of the proteomes we provide : one fasta file containing non-redundant FASTA sets for the canonical sequences, gene to protein mapping file and idmapping containing all cross-references link to those proteins. See http://www.ebi.ac.uk/reference_proteomes/.

Quest for Orthologs (QfO), a gene centric view

UniProt Consor5um. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res., 40:D71-‐5, 2012. T. Gabaldon, C. Dessimoz, J. Huxley-‐Jones, A. Vilella, E. Sonnhammer and S. Lewis. Joining forces in the quest for orthologs. Genome Biology, 9:403, 2009. 1000 Genomes Project Consor5um. A map of human genome varia=on from popula=on-‐scale sequencing. Nature 467 (7319): 1061–1073, 2010. SA. Forbes, G. Bhamra, S. Bamford, E. Dawson, C. Kok, et al. The Catalogue of Soma=c Muta=ons in Cancer (COSMIC). Curr Protoc Hum Genet, Chapter 10: Unit 10-‐11, 2008.

Documents

How UniProtKB Maps Genomes And Variants and Provides … · How UniProtKB Maps Genomes And Variants and Provides This Information. Hermann ZELLNER, Benoît BELY, Andrew NIGHTINGALE,