Proteomics public data resources: enabling "big data" analysis in proteomics

EMBL-EBI Now and in the Future

Proteomics public data resources: enabling big data analysis in proteomicsDr. Juan Antonio Vizcano

EMBL-European Bioinformatics InstituteHinxton, Cambridge, UK

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016

1

OverviewIntro: Concept of Big data in biology and proteomics

PRIDE Archive and ProteomeXchange

PRIDE tools

Reuse of public proteomics data

Working with Big data: PRIDE Cluster

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Big data: definition

Slide from: http://www.ibmbigdatahub.com/

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Volume. The quantity of generated and stored data. The size of the data determines the value and potential insight- and whether it can actually be considered big data or not.Variety. The type and nature of the data. This helps people who analyze it to effectively use the resulting insight.Velocity. In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and Veracity. The quality of captured data can vary greatly, affecting accurate analysis.3

Big data in biology

The term has been applied so far mainly to genomics

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016One slide intro to MS based proteomics

Hein et al., Handbook of Systems Biology, 2012

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016OverviewIntro: Concept of Big data in biology and proteomics


PRIDE tools




6

Data resources at EMBL-EBIGenes, genomes & variationArrayExpressExpression Atlas

PRIDEInterProPfamUniProtChEMBLChEBIMolecular structuresProtein Data Bank in EuropeElectron Microscopy Data BankEuropean Nucleotide ArchiveEuropean Variation ArchiveEuropean Genome-phenome ArchiveGene & protein expressionProtein sequences, families & motifsChemical biologyReactions, interactions & pathwaysIntActReactomeMetaboLightsSystemsBioModelsEnzyme PortalBioSamplesEnsembl Ensembl GenomesGWAS CatalogMetagenomics portalEurope PubMed CentralGene OntologyExperimental Factor OntologyLiterature & ontologies

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016The slide shows the core resources at the EBI to show the range of data you can access through the EBI.

7

What is a proteomics publication in 2016?Proteomics studies generate potentially large amounts of data and results.

Ideally, a proteomics publication needs to:Summarize the results of the studyProvide supporting information for reliability of any results reported

Information in a publication:ManuscriptSupplementary materialAssociated data submitted to a public repository


PRIDE stores mass spectrometry (MS)-based proteomics data:Peptide and protein expression data (identification and quantification)Post-translational modificationsMass spectra (raw data and peak lists)Technical and biological metadataAny other related information

Full support for tandem MS approaches

PRIDE (PRoteomics IDEntifications) Archivehttp://www.ebi.ac.uk/pride/archiveMartens et al., Proteomics, 2005Vizcano et al., NAR, 2016


9

ProteomeXchange: A Global, distributed proteomics database

PASSEL (SRM data)

PRIDE (MS/MS data)

MassIVE (MS/MS data)

Raw

ID/Q

Meta

jPOST(MS/MS data)

Mandatory raw data deposition since July 2015

Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories.

http://www.proteomexchange.orgNew in 2016Vizcano et al., Nat Biotechnol, 2014Deustch et al., NAR, 2017, in press

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016ProteomeCentralMetadata / ManuscriptRaw DataResults

Journals

Peptide Atlas Receiving repositories

PRIDE

Researchers results

Raw dataMetadata

PASSEL

Research groupsReanalysis of datasets

MassIVE

jPOST MS/MS data(as completesubmissions)

Any other workflow (mainly partial submissions)

DATASETS

SRM data

Reprocessed results

MassIVEProteomeXchange data workflow

Vizcano et al., Nat Biotechnol, 2014Deustch et al., NAR, 2017, in press


11

ProteomeCentral: Centralised portal for all PX datasetshttp://proteomecentral.proteomexchange.org/cgi/GetDataset

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016ProteomeCentralMetadata / ManuscriptRaw DataResults

Journals

Peptide Atlas Receiving repositories

PRIDE

Researchers results

Raw dataMetadata

PASSEL


MassIVE



DATASETS

SRM data

Reprocessed results




13

ProteomeCentralMetadata / ManuscriptRaw DataResults

Journals

UniProt/neXtProt

Peptide Atlas

Other DBs Receiving repositories

PRIDE

GPMDB

Researchers results

Raw dataMetadata

PASSEL

proteomicsDB


MassIVE



DATASETS

OmicsDIIntegration with other omics datasets

SRM data

Reprocessed results




14

PRIDE: Source of MS proteomics data

PRIDE Archive already provides or will soon provide MS proteomics data to other EMBL-EBI resources such as UniProt, Ensembl and the EBI Expression Atlas.

http://www.ebi.ac.uk/pride/archive


15

PRIDE Archive over 4,500 datasets from over 51 countries and 1,700 groupsUSA 814 datasetsGermany 528 UK 338China 328France 222Netherlands 175Canada - 137

Data volume:Total: ~275 TB Number of all files: ~560,000PXD000320-324: ~ 4 TBPXD002319-26 ~2.4 TBPXD001471 ~1.6 TB1,973 datasets i.e. 52% of all are publicly accessible~90% of all ProteomeXchange datasets

YearSubmissionsAll submissionsCompletePRIDE Archive growthIn the last 12 months: ~165 submitted datasets per monthTop Species studied by at least 100 datasets:2,010 Homo sapiens 604 Mus musculus 191 Saccharomyces cerevisiae 140 Arabidopsis thaliana 127 Rattus norvegicus >900 reported taxa in total

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016(> 922 processed by MaxQuant)

16



PRIDE tools



Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Components: Data Submission ProcessPRIDE Converter 2PRIDE InspectorPX Submission Tool

mzIdentMLPRIDE XMLIn addition to PRIDE Archive, the PRIDE team develops and maintains different tools and software libraries to facilitate the handling and visualisation of MS proteomics data and the submission process


18

Current PSI Standard File Formats for MS

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Inspector Toolsuite

Wang et al., Nat. Biotechnology, 2012Perez-Riverol et al., Bioinformatics, 2015Perez-Riverol et al., MCP, 2016

PRIDE Inspector - standalone tool to enable visualisation and validation of MS data. Build on top of ms-data-core-api - open source algorithms and libraries for computational proteomics.Supported file formats: mzIdentML, mzML, mzTab (PSI standards), and PRIDE XML.Broad functionality.

https://github.com/PRIDE-Utilities/ms-data-core-apihttps://github.com/PRIDE-Toolsuite/pride-inspector

Summary and QC charts

Peptide spectra annotation and visualization


20

PX Submission Tool

Desktop application for data submissions to ProteomeXchange via PRIDE

Implemented in Java 7Streamlines the submission processCapture mappings between filesRetain metadataFast file transfer with Aspera (FASP transfer technology) FTP also availableCommand line option

Submission tool screenshot


21



PRIDE tools



Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Datasets are being reused more and more.

Vaudel et al., Proteomics, 2016Data download volume for PRIDE Archive in 2015: 198 TB


23

Data sharing in Proteomics

Vaudel et al., Proteomics, 2016

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Draft Human proteome papers published in 2014

Wilhelm et al., Nature, 2014Kim et al., Nature, 2014

Two independent groups claimed to have produced the first complete draft of the human proteome by MS.

Some of their findings are controversial and need further validation but generated a lot of discussion and put proteomics in the spotlight.

They used many different tissues.Nature cover 29 May 2014

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Draft Human proteome papers published in 2014

Wilhelm et al., Nature, 2014Around 60% of the data used for the analysis comes from previous experiments, most of them stored in proteomics repositories such as PRIDE/ProteomeXchange, PASSEL or MassIVE.

They complement that data with exotic tissues.

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Data sharing in Proteomics


Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Examples of repurposing in proteogenomics


Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Challenges for data reuse in proteomicsInsufficient technical and biological metadata.

Large computational infrastructure maybe needed (e.g. when analysing many datasets together).

Shortage of expertise (people).

Lack of standardisation in the field.

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Summary of the talk so far PRIDE Archive and other ProteomeXchange resources make possible data sharing in the MS proteomics field.Data sharing is becoming the norm in the field.

Standalone tools: PRIDE Inspector and PX Submission tool.

Datasets are increasingly reused (many opportunities):Example of one of the drafts of the human proteome.Proteogenomics approaches.But there are important challenges as well.

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016OverviewIntro: Concept of Big data in biology and proteomics


PRIDE tools





Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Cluster: Initial MotivationProvide a QC-filtered peptide-centric view of PRIDE.

Data is stored in PRIDE Archive as originally analysed by the submitters (no data reprocessing is done).

Heterogeneous quality, difficult to make the data comparable.

Enable assessment of (published) proteomics data.

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Cluster

Provide an aggregated peptide centric view of PRIDE Archive.Hypothesis: same peptide will generate similar MS/MS spectra across experiments.Enables QC of peptide-spectrum matches (PSMs). Infer reliable identifications by comparing submitted identifications of spectra within a cluster.

After clustering, a representative spectrum is built for all peptides consistently identified across different datasets.Griss et al., Nat. Methods, 2013Griss et al., Nat. Methods, 2016


35

PRIDE Cluster - Concept

NMMAACDPR NMMAACDPR PPECPDFDPPR

NMMAACDPR NMMAACDPR

NMMAACDPR Consensus spectrumPPECPDFDPPRThreshold: At least 3 spectra in a cluster and ratio >70%.Originally submitted identified spectraSpectrumclustering

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Cluster - Concept

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Cluster: ImplementationGriss et al., Nat. Methods, 2013

Clustered all public, identified spectra in PRIDEEBI compute farm, LSF20.7 M identified spectra610 CPU days, two calendar weeksValidation, calibrationFeedback into PRIDE datasetsEBI farm, LSF

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Cluster Iteration 2: Why?PRIDE Archive has experienced a huge increase in data since 2013.We wanted to develop an algorithm that could also work with unidentified spectra.

YearSubmissionsAll submissionsCompletePRIDE Archive growth


39

Parallelizing Spectrum Clustering: HadoopOptimizes work distribution among machines.Hadoop is a (open source) Framework for parallelism using the Map-Reduce algorithm by Google. Solves many general issues of large parallel jobs:Schedulinginter-job communicationfailure

https://hadoop.apache.org/

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Cluster: Second ImplementationGriss et al., Nat. Methods, 2013

Clustered all public, identified spectra in PRIDEEBI compute farm, LSF20.7 M identified spectra610 CPU days, two calendar weeksValidation, calibrationFeedback into PRIDE datasetsEBI farm, LSF

Griss et al., Nat. Methods, 2016Clustered all public spectra in PRIDE by April 2015Apache Hadoop.Starting with 256 M spectra.190 M unidentified spectra (they were filtered to 111 M for spectra that are likely to represent a peptide).66 M identified spectraResult: 28 M clusters 5 calendar days on 30 node Hadoop cluster, 340 CPU cores

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Examples: one perfect cluster

880 PSMs give the same peptide ID4 species28 datasetsSame instruments

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Examples: one perfect cluster (2)

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Cluster

Sequence-based search enginesSpectrum clusteringIncorrectly or unidentified spectra

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Output of the analysis1. Inconsistent spectrum clusters

2. Clusters including identified and unidentified spectra.

3. Clusters just containing unidentified spectra.

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 20161. Re-analysis of inconsistent clusters

NMMAACDPR NMMAACDPR IGGIGTVPVGRNMMAACDPRPPECPDFDPPRVFDEFKPLVEEPQNLIKNMMAACDPRIGGIGTVPVGR

No sequence has a proportion in the cluster >50% Consensus spectrum

PPECPDFDPPRVFDEFKPLVEEPQNLIK Originally submitted identified spectraSpectrumclustering

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 20161. Re-analysis of inconsistent clustersRe-analysed 3,997 large (>100 spectra), inconsistent clusters with PepNovo, SpectraST, X!Tandem.453 clusters (11%) were identified as peptides originated from keratins, trypsin, albumin, and hemoglobin.In this case, it is likely that a contaminants DB was not used in the search.

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Validation





2. Inferring identifications for originally unidentified spectra529.1 M unidentified spectra were contained in clusters with a reliable identification.These are candidate new identifications (that need to be confirmed), often missed due to search engine settingsExample: 49,263 reliable clusters (containing 560,000 identified and 130,000 unidentified spectra) contained phosphorylated peptides, many of them from non-enriched studies.

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 20163. Consistently unidentified clusters19 M clusters contain only unidentified spectra.

41,155 of these spectra have more than 100 spectra (= 12 M spectra).

Most of them are likely to be derived from peptides.

They could correspond to PTMs or variant peptides.

With various methods, we found likely identifications for about 20%.

Vast amount of data mining remains to be done.

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 20163. Consistently unidentified clusters

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Cluster as a Public Data Mining Resource55http://www.ebi.ac.uk/pride/cluster Spectral libraries for 16 species.All clustering results, as well as specific subsets of interest available.Source code (open source) and Java API

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Public datasets from different omics: OmicsDIhttp://www.ebi.ac.uk/Tools/omicsdi/Aims to integrate of omics datasets (proteomics, transcriptomics, metabolomics and genomics at present). PRIDE MassIVEjPOSTPASSELGPMDB

ArrayExpressExpression Atlas

MetaboLightsMetabolomics WorkbenchGNPS

EGAPerez-Riverol et al., 2016, BioRXxiv

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Proteomes provide an across-dataset and quality filtered view on PRIDE Archive data. Good PSMs are assessed using the PRIDE Cluster approach, based on spectral clustering.56

OmicsDI: Portal for omics datasets






Summary part 2Using a big data approach we were able to get extra knowledge from all the public data in PRIDE Archive.

Spectrum clustering enables QC in proteomics resources such as PRIDE Archive.

It is possible to detect spectra that are consistently unidentified across hundreds of datasets (maybe peptide variants, or peptides with PTMs not initially considered).

OmicsDI: new platform to identify public datasets coming from different omics technologies (more possibilities for data reuse!)

Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Aknowledgements: PeopleAttila CsordasTobias TernentGerhard Mayer (de.NBI)

Johannes GrissYasset Perez-RiverolManuel Bernal-LlinaresAndrew Jarnuczak

Enrique Perez

Former team members, especially Rui Wang, Florian Reisinger, Noemi del Toro, Jose A. Dianes & Henning Hermjakob

Acknowledgements: The PRIDE Team

All data submitters !!!

@pride_ebi@proteomexchange


Questions?

http://www.slideshare.net/JuanAntonioVizcaino


Science

Proteomics public data resources: enabling "big data" analysis in proteomics