Upload
juan-antonio-vizcaino
View
197
Download
0
Embed Size (px)
Citation preview
EMBL-EBI Now and in the Future
Proteomics public data resources: enabling big data analysis in proteomicsDr. Juan Antonio Vizcano
EMBL-European Bioinformatics InstituteHinxton, Cambridge, UK
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016
1
OverviewIntro: Concept of Big data in biology and proteomics
PRIDE Archive and ProteomeXchange
PRIDE tools
Reuse of public proteomics data
Working with Big data: PRIDE Cluster
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Big data: definition
Slide from: http://www.ibmbigdatahub.com/
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Volume. The quantity of generated and stored data. The size of the data determines the value and potential insight- and whether it can actually be considered big data or not.Variety. The type and nature of the data. This helps people who analyze it to effectively use the resulting insight.Velocity. In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and Veracity. The quality of captured data can vary greatly, affecting accurate analysis.3
Big data in biology
The term has been applied so far mainly to genomics
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016One slide intro to MS based proteomics
Hein et al., Handbook of Systems Biology, 2012
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016OverviewIntro: Concept of Big data in biology and proteomics
PRIDE Archive and ProteomeXchange
PRIDE tools
Reuse of public proteomics data
Working with Big data: PRIDE Cluster
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016
6
Data resources at EMBL-EBIGenes, genomes & variationArrayExpressExpression Atlas
PRIDEInterProPfamUniProtChEMBLChEBIMolecular structuresProtein Data Bank in EuropeElectron Microscopy Data BankEuropean Nucleotide ArchiveEuropean Variation ArchiveEuropean Genome-phenome ArchiveGene & protein expressionProtein sequences, families & motifsChemical biologyReactions, interactions & pathwaysIntActReactomeMetaboLightsSystemsBioModelsEnzyme PortalBioSamplesEnsembl Ensembl GenomesGWAS CatalogMetagenomics portalEurope PubMed CentralGene OntologyExperimental Factor OntologyLiterature & ontologies
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016The slide shows the core resources at the EBI to show the range of data you can access through the EBI.
7
What is a proteomics publication in 2016?Proteomics studies generate potentially large amounts of data and results.
Ideally, a proteomics publication needs to:Summarize the results of the studyProvide supporting information for reliability of any results reported
Information in a publication:ManuscriptSupplementary materialAssociated data submitted to a public repository
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016
PRIDE stores mass spectrometry (MS)-based proteomics data:Peptide and protein expression data (identification and quantification)Post-translational modificationsMass spectra (raw data and peak lists)Technical and biological metadataAny other related information
Full support for tandem MS approaches
PRIDE (PRoteomics IDEntifications) Archivehttp://www.ebi.ac.uk/pride/archiveMartens et al., Proteomics, 2005Vizcano et al., NAR, 2016
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016
9
ProteomeXchange: A Global, distributed proteomics database
PASSEL (SRM data)
PRIDE (MS/MS data)
MassIVE (MS/MS data)
Raw
ID/Q
Meta
jPOST(MS/MS data)
Mandatory raw data deposition since July 2015
Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories.
http://www.proteomexchange.orgNew in 2016Vizcano et al., Nat Biotechnol, 2014Deustch et al., NAR, 2017, in press
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016ProteomeCentralMetadata / ManuscriptRaw DataResults
Journals
Peptide Atlas Receiving repositories
PRIDE
Researchers results
Raw dataMetadata
PASSEL
Research groupsReanalysis of datasets
MassIVE
jPOST MS/MS data(as completesubmissions)
Any other workflow (mainly partial submissions)
DATASETS
SRM data
Reprocessed results
MassIVEProteomeXchange data workflow
Vizcano et al., Nat Biotechnol, 2014Deustch et al., NAR, 2017, in press
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016
11
ProteomeCentral: Centralised portal for all PX datasetshttp://proteomecentral.proteomexchange.org/cgi/GetDataset
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016ProteomeCentralMetadata / ManuscriptRaw DataResults
Journals
Peptide Atlas Receiving repositories
PRIDE
Researchers results
Raw dataMetadata
PASSEL
Research groupsReanalysis of datasets
MassIVE
jPOST MS/MS data(as completesubmissions)
Any other workflow (mainly partial submissions)
DATASETS
SRM data
Reprocessed results
MassIVEProteomeXchange data workflow
Vizcano et al., Nat Biotechnol, 2014Deustch et al., NAR, 2017, in press
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016
13
ProteomeCentralMetadata / ManuscriptRaw DataResults
Journals
UniProt/neXtProt
Peptide Atlas
Other DBs Receiving repositories
PRIDE
GPMDB
Researchers results
Raw dataMetadata
PASSEL
proteomicsDB
Research groupsReanalysis of datasets
MassIVE
jPOST MS/MS data(as completesubmissions)
Any other workflow (mainly partial submissions)
DATASETS
OmicsDIIntegration with other omics datasets
SRM data
Reprocessed results
MassIVEProteomeXchange data workflow
Vizcano et al., Nat Biotechnol, 2014Deustch et al., NAR, 2017, in press
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016
14
PRIDE: Source of MS proteomics data
PRIDE Archive already provides or will soon provide MS proteomics data to other EMBL-EBI resources such as UniProt, Ensembl and the EBI Expression Atlas.
http://www.ebi.ac.uk/pride/archive
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016
15
PRIDE Archive over 4,500 datasets from over 51 countries and 1,700 groupsUSA 814 datasetsGermany 528 UK 338China 328France 222Netherlands 175Canada - 137
Data volume:Total: ~275 TB Number of all files: ~560,000PXD000320-324: ~ 4 TBPXD002319-26 ~2.4 TBPXD001471 ~1.6 TB1,973 datasets i.e. 52% of all are publicly accessible~90% of all ProteomeXchange datasets
YearSubmissionsAll submissionsCompletePRIDE Archive growthIn the last 12 months: ~165 submitted datasets per monthTop Species studied by at least 100 datasets:2,010 Homo sapiens 604 Mus musculus 191 Saccharomyces cerevisiae 140 Arabidopsis thaliana 127 Rattus norvegicus >900 reported taxa in total
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016(> 922 processed by MaxQuant)
16
OverviewIntro: Concept of Big data in biology and proteomics
PRIDE Archive and ProteomeXchange
PRIDE tools
Reuse of public proteomics data
Working with Big data: PRIDE Cluster
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Components: Data Submission ProcessPRIDE Converter 2PRIDE InspectorPX Submission Tool
mzIdentMLPRIDE XMLIn addition to PRIDE Archive, the PRIDE team develops and maintains different tools and software libraries to facilitate the handling and visualisation of MS proteomics data and the submission process
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016
18
Current PSI Standard File Formats for MS
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Inspector Toolsuite
Wang et al., Nat. Biotechnology, 2012Perez-Riverol et al., Bioinformatics, 2015Perez-Riverol et al., MCP, 2016
PRIDE Inspector - standalone tool to enable visualisation and validation of MS data. Build on top of ms-data-core-api - open source algorithms and libraries for computational proteomics.Supported file formats: mzIdentML, mzML, mzTab (PSI standards), and PRIDE XML.Broad functionality.
https://github.com/PRIDE-Utilities/ms-data-core-apihttps://github.com/PRIDE-Toolsuite/pride-inspector
Summary and QC charts
Peptide spectra annotation and visualization
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016
20
PX Submission Tool
Desktop application for data submissions to ProteomeXchange via PRIDE
Implemented in Java 7Streamlines the submission processCapture mappings between filesRetain metadataFast file transfer with Aspera (FASP transfer technology) FTP also availableCommand line option
Submission tool screenshot
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016
21
OverviewIntro: Concept of Big data in biology and proteomics
PRIDE Archive and ProteomeXchange
PRIDE tools
Reuse of public proteomics data
Working with Big data: PRIDE Cluster
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Datasets are being reused more and more.
Vaudel et al., Proteomics, 2016Data download volume for PRIDE Archive in 2015: 198 TB
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016
23
Data sharing in Proteomics
Vaudel et al., Proteomics, 2016
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014Kim et al., Nature, 2014
Two independent groups claimed to have produced the first complete draft of the human proteome by MS.
Some of their findings are controversial and need further validation but generated a lot of discussion and put proteomics in the spotlight.
They used many different tissues.Nature cover 29 May 2014
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014Around 60% of the data used for the analysis comes from previous experiments, most of them stored in proteomics repositories such as PRIDE/ProteomeXchange, PASSEL or MassIVE.
They complement that data with exotic tissues.
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Data sharing in Proteomics
Vaudel et al., Proteomics, 2016
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Examples of repurposing in proteogenomics
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Data sharing in Proteomics
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Challenges for data reuse in proteomicsInsufficient technical and biological metadata.
Large computational infrastructure maybe needed (e.g. when analysing many datasets together).
Shortage of expertise (people).
Lack of standardisation in the field.
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Summary of the talk so far PRIDE Archive and other ProteomeXchange resources make possible data sharing in the MS proteomics field.Data sharing is becoming the norm in the field.
Standalone tools: PRIDE Inspector and PX Submission tool.
Datasets are increasingly reused (many opportunities):Example of one of the drafts of the human proteome.Proteogenomics approaches.But there are important challenges as well.
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016OverviewIntro: Concept of Big data in biology and proteomics
PRIDE Archive and ProteomeXchange
PRIDE tools
Reuse of public proteomics data
Working with Big data: PRIDE Cluster
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Data sharing in Proteomics
Vaudel et al., Proteomics, 2016
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Cluster: Initial MotivationProvide a QC-filtered peptide-centric view of PRIDE.
Data is stored in PRIDE Archive as originally analysed by the submitters (no data reprocessing is done).
Heterogeneous quality, difficult to make the data comparable.
Enable assessment of (published) proteomics data.
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Cluster
Provide an aggregated peptide centric view of PRIDE Archive.Hypothesis: same peptide will generate similar MS/MS spectra across experiments.Enables QC of peptide-spectrum matches (PSMs). Infer reliable identifications by comparing submitted identifications of spectra within a cluster.
After clustering, a representative spectrum is built for all peptides consistently identified across different datasets.Griss et al., Nat. Methods, 2013Griss et al., Nat. Methods, 2016
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016
35
PRIDE Cluster - Concept
NMMAACDPR NMMAACDPR PPECPDFDPPR
NMMAACDPR NMMAACDPR
NMMAACDPR Consensus spectrumPPECPDFDPPRThreshold: At least 3 spectra in a cluster and ratio >70%.Originally submitted identified spectraSpectrumclustering
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Cluster - Concept
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Cluster: ImplementationGriss et al., Nat. Methods, 2013
Clustered all public, identified spectra in PRIDEEBI compute farm, LSF20.7 M identified spectra610 CPU days, two calendar weeksValidation, calibrationFeedback into PRIDE datasetsEBI farm, LSF
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Cluster Iteration 2: Why?PRIDE Archive has experienced a huge increase in data since 2013.We wanted to develop an algorithm that could also work with unidentified spectra.
YearSubmissionsAll submissionsCompletePRIDE Archive growth
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016
39
Parallelizing Spectrum Clustering: HadoopOptimizes work distribution among machines.Hadoop is a (open source) Framework for parallelism using the Map-Reduce algorithm by Google. Solves many general issues of large parallel jobs:Schedulinginter-job communicationfailure
https://hadoop.apache.org/
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Cluster: Second ImplementationGriss et al., Nat. Methods, 2013
Clustered all public, identified spectra in PRIDEEBI compute farm, LSF20.7 M identified spectra610 CPU days, two calendar weeksValidation, calibrationFeedback into PRIDE datasetsEBI farm, LSF
Griss et al., Nat. Methods, 2016Clustered all public spectra in PRIDE by April 2015Apache Hadoop.Starting with 256 M spectra.190 M unidentified spectra (they were filtered to 111 M for spectra that are likely to represent a peptide).66 M identified spectraResult: 28 M clusters 5 calendar days on 30 node Hadoop cluster, 340 CPU cores
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Examples: one perfect cluster
880 PSMs give the same peptide ID4 species28 datasetsSame instruments
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Examples: one perfect cluster (2)
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Cluster
Sequence-based search enginesSpectrum clusteringIncorrectly or unidentified spectra
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Output of the analysis1. Inconsistent spectrum clusters
2. Clusters including identified and unidentified spectra.
3. Clusters just containing unidentified spectra.
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 20161. Re-analysis of inconsistent clusters
NMMAACDPR NMMAACDPR IGGIGTVPVGRNMMAACDPRPPECPDFDPPRVFDEFKPLVEEPQNLIKNMMAACDPRIGGIGTVPVGR
No sequence has a proportion in the cluster >50% Consensus spectrum
PPECPDFDPPRVFDEFKPLVEEPQNLIK Originally submitted identified spectraSpectrumclustering
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 20161. Re-analysis of inconsistent clustersRe-analysed 3,997 large (>100 spectra), inconsistent clusters with PepNovo, SpectraST, X!Tandem.453 clusters (11%) were identified as peptides originated from keratins, trypsin, albumin, and hemoglobin.In this case, it is likely that a contaminants DB was not used in the search.
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Validation
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016
2. Inferring identifications for originally unidentified spectra529.1 M unidentified spectra were contained in clusters with a reliable identification.These are candidate new identifications (that need to be confirmed), often missed due to search engine settingsExample: 49,263 reliable clusters (containing 560,000 identified and 130,000 unidentified spectra) contained phosphorylated peptides, many of them from non-enriched studies.
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 20163. Consistently unidentified clusters19 M clusters contain only unidentified spectra.
41,155 of these spectra have more than 100 spectra (= 12 M spectra).
Most of them are likely to be derived from peptides.
They could correspond to PTMs or variant peptides.
With various methods, we found likely identifications for about 20%.
Vast amount of data mining remains to be done.
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 20163. Consistently unidentified clusters
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Cluster as a Public Data Mining Resource55http://www.ebi.ac.uk/pride/cluster Spectral libraries for 16 species.All clustering results, as well as specific subsets of interest available.Source code (open source) and Java API
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Public datasets from different omics: OmicsDIhttp://www.ebi.ac.uk/Tools/omicsdi/Aims to integrate of omics datasets (proteomics, transcriptomics, metabolomics and genomics at present). PRIDE MassIVEjPOSTPASSELGPMDB
ArrayExpressExpression Atlas
MetaboLightsMetabolomics WorkbenchGNPS
EGAPerez-Riverol et al., 2016, BioRXxiv
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Proteomes provide an across-dataset and quality filtered view on PRIDE Archive data. Good PSMs are assessed using the PRIDE Cluster approach, based on spectral clustering.56
OmicsDI: Portal for omics datasets
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Proteomes provide an across-dataset and quality filtered view on PRIDE Archive data. Good PSMs are assessed using the PRIDE Cluster approach, based on spectral clustering.57
OmicsDI: Portal for omics datasets
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Proteomes provide an across-dataset and quality filtered view on PRIDE Archive data. Good PSMs are assessed using the PRIDE Cluster approach, based on spectral clustering.58
OmicsDI: Portal for omics datasets
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016PRIDE Proteomes provide an across-dataset and quality filtered view on PRIDE Archive data. Good PSMs are assessed using the PRIDE Cluster approach, based on spectral clustering.59
Summary part 2Using a big data approach we were able to get extra knowledge from all the public data in PRIDE Archive.
Spectrum clustering enables QC in proteomics resources such as PRIDE Archive.
It is possible to detect spectra that are consistently unidentified across hundreds of datasets (maybe peptide variants, or peptides with PTMs not initially considered).
OmicsDI: new platform to identify public datasets coming from different omics technologies (more possibilities for data reuse!)
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 2016Aknowledgements: PeopleAttila CsordasTobias TernentGerhard Mayer (de.NBI)
Johannes GrissYasset Perez-RiverolManuel Bernal-LlinaresAndrew Jarnuczak
Enrique Perez
Former team members, especially Rui Wang, Florian Reisinger, Noemi del Toro, Jose A. Dianes & Henning Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!
@pride_ebi@proteomexchange
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 201661
Questions?
http://www.slideshare.net/JuanAntonioVizcaino
Juan A. [email protected] de.NBI SymposiumHeidelberg, 9 November 201662