39
Exploring the potential of public proteomics data Dr. Juan Antonio Vizcaíno Proteomics Team Leader EMBL-EBI Hinxton, Cambridge, UK

Reuse of public data in proteomics

Embed Size (px)

Citation preview

EMBL-EBI Now and in the Future

Exploring the potential of public proteomics dataDr. Juan Antonio Vizcano

Proteomics Team LeaderEMBL-EBIHinxton, Cambridge, UK

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Datasets are being reused more and more.

Vaudel et al., Proteomics, 2016Data download volume for PRIDE Archive in 2015: 198 TB

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

2

Data sharing in Proteomics

Vaudel et al., Proteomics, 2016

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Data sharing in Proteomics

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Data sharing in Proteomics

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Data sharing in Proteomics

Data as they are.

Protein knowledge bases: UniProt, neXtProt.

Contributing to the Protein Evidence Code.

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Protein Evidence codes in UniProt/neXtProt

http://www.uniprot.org/help/protein_existence

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Use of MS data in UniProt

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Use of MS data in neXtProt

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Data sharing in Proteomics

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Reuse

Information is not only extracted, but reused in new experiments with the potential of generating new knowledge.

Transitions used in SRM approaches.

Meta-analysis approaches.

Spectral libraries.

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016SRMAtlas

http://www.srmatlas.org/

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016PeptidePicker

http://mrmpeptidepicker.proteincentre.com/

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Meta-analysis approaches

Putting data coming from a lot of experiments together, to extract new knowledge. Examples:Study the cleavage mechanism and performance of trypsin.Fragmentation patterns.Retention time prediction.Which is the most suitable reference DB for long-term proteomics data storage?Data integration of experiments done at different time points.

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Spectral searching

Concept: To compare experimental spectra to other experimental spectra.

There are many spectral libraries publicly available (for instance, from NIST, PeptideAtlas and PRIDE)

Custom search engines have been developed:SpectraST (TPP)X!Hunter (GPM)Bibliospec

It has been claimed that the searches have more sensitivity that with sequence database approaches

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Spectral searching (2)http://peptide.nist.gov/

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016PRIDE Cluster as a Public Data Mining Resource17http://www.ebi.ac.uk/pride/cluster Spectral libraries for 16 species.All clustering results, as well as specific subsets of interest available.Source code (open source) and Java API

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Data sharing in Proteomics

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Reprocess

Data are reprocessed with the intention of obtaining new knowledge or to provide an updated view on the results.

It mainly serves the same purpose of the original experiment.

For instance, a shot-gun dataset can be reprocessed with a different algorithm or an updated sequence database.

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Reprocessing repositoriesThese resources collect MS raw data and reprocess it using one given analysis pipeline, and an up-to date protein sequence database.

Main resources: GPMDB and PeptideAtlas (ISB, Seattle).

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016PeptideAtlas and GPMDB

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Draft Human proteome papers published in 2014

Wilhelm et al., Nature, 2014Around 60% of the data used for the analysis comes from previous experiments, most of them stored in proteomics repositories such as PRIDE/ProteomeXchange, PASSEL or MassIVE.

They complement that data with exotic tissues.

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Reprocessing for the validation of controversial data

Analysis of Tyrannosaurus rex fossils: controversial presence of collagen (is it a contamination of the sample? Did the sample contain any T. rex proteins at all?)

Asara et al. (2007) Science 316: 280-5.Asara et al. (2007) Science 316: 1324-5.Bern et al. (2009) JPR 9: 4328-32

PRIDE Archive assay accession 8633

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Asara et al. reported the detection of collagen peptides in a 68-million-year-old Tyrannosaurus rex bone by shotgun proteomics. This finding has been called into question as a possible statistical artifact. We reanalyze Asara et al.'s tandem mass spectra using a different search engine and different statistical tools. Our reanalysis shows a sample containing common laboratory contaminants, soil bacteria, and bird-like hemoglobin and collagen.23

Info from R. ChalkleyBromenshenk et al. (2011) PLOS One 5: e13181Reprocessing for the validation of controversial data (2)

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

Experimental ProtocolCollected samples from healthy, collapsing and collapsed bee colonies.Homogenised bees.Digested with TrypsinAnalyzed by LC-MSMS on LTQSearched using SequestFiltered Results using Peptide and Protein ProphetPerformed further analysis to determine species statistically more commonly found in collapsing/collapsed colony samplesInfo from R. ChalkleyBromenshenk et al. (2011) PLOS One 5: e13181Reprocessing for the validation of controversial data (3)

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

Big pitfall: Search database was only composed by viral proteins. Not bee proteins at all!!After researching the data, there is no evidence for viral peptides/proteins in any of their data: honey bee, fruit fly, wasp, moth, human keratin, bacteria that like sugary environments, We believe that there is currently insufficient evidence to conclude that bees are a natural host for IIV-6, let alone that the virus is linked to CCD.

Info from R. ChalkleyKnudsen & Chalkley (2011) PLOS One 6: e20873Foster (2011), MCP 10: M110.006387Reprocessing for the validation of controversial data (4)

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Reprocessing for the validation of controversial data

Datasets PXD000561 and PXD000865 in PRIDE Archive

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Various reanalysis of these datasets have been performed

Reanalysis of Pandey dataset (Nature, 2014) made by J. Choudharys group at Sanger Institute

Wright et al., Nat Commun, 2016Dataset PXD000561http://www.ebi.ac.uk/gxa

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Data sharing in Proteomics

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Repurposing

Data are considered in light of a question or a context that is different from the original study.

Proteogenomics studies

Discovery of novel PTMs.

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Examples of repurposing datasets: proteogenomics

Data in public resources can be used for genome annotation purposes

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Repurposing: new PTMs foundIndividual authors can reprocess raw data with new hypotheses in mind (not taken into account by the original authors).

Recent examples (using phosphoproteomics data sets):

O-GlcNAc-6-phosphate1Phosphoglyceryl2ADP-ribosylation3

1Hahne & Kuster, Mol Cell Proteomics (2012) 11 10 1063-92Moellering & Cravatt, Science (2013) 341 549-5533Matic et al., Nat Methods (2012) 9 771-2

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Vaudel M, Barsnes H, Berven FS, Sickmann A, Martens L: Proteomics 2011;11(5):996-9.https://github.com/compomics/searchguihttps://github.com/compomics/peptide-shakerVaudel M, Burkhart J, Zahedi RP, Berven FS, Sickmann A, Martens L, Barsnes H:Nature Biotechnology 2015; 33(1):22-4.

CompOmics Open Source Analysis Pipeline

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

Find the desired PRIDE project and start re-analyzing the data! inspect the project details .Reshake PRIDE data!

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Public datasets from different omics: OmicsDIhttp://www.ebi.ac.uk/Tools/omicsdi/Aims to integrate of omics datasets (proteomics, transcriptomics, metabolomics and genomics at present). PRIDE MassIVEjPOSTPASSELGPMDB

ArrayExpressExpression Atlas

MetaboLightsMetabolomics WorkbenchGNPS

EGAPerez-Riverol et al., Nat Biotechnol, in press

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016PRIDE Proteomes provide an across-dataset and quality filtered view on PRIDE Archive data. Good PSMs are assessed using the PRIDE Cluster approach, based on spectral clustering.35

OmicsDI: Portal for omics datasets

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016PRIDE Proteomes provide an across-dataset and quality filtered view on PRIDE Archive data. Good PSMs are assessed using the PRIDE Cluster approach, based on spectral clustering.36

OmicsDI: Portal for omics datasets

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016PRIDE Proteomes provide an across-dataset and quality filtered view on PRIDE Archive data. Good PSMs are assessed using the PRIDE Cluster approach, based on spectral clustering.37

Acknowledgementshttp://www.ncbi.nlm.nih.gov/pubmed/26449181http://onlinelibrary.wiley.com/doi/10.1002/pmic.201500295/epdf

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Questions?

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 201639