Upload
attilacsordas
View
514
Download
2
Embed Size (px)
DESCRIPTION
The ppt version of a talk I gave at the Biocuration 2012 Conference in Washington DC at Georgetown University in front of ~300 people.
Citation preview
PRIDE: Quality control in a proteomics data repository
Attila Csordas
Proteomics Services Team
Biocuration Conference
April 2nd, 2012
1/23
April 2, 2012
Overview
who are we?
what are we dealing with?
manual curation and submission
quick detour: ProteomeXchange
automated curation & submission pipeline
conclusion
2/23
April 2, 2012
The PRoteomics IDEntifications database is a centralised,
primary, archival, public data repository for MS/MS proteomics data containing peptide ids, protein ids, mass spectra, protein expression values,
metadata.
PRIDE: http://www.ebi.ac.uk/pride
3/23
April 2, 2012
Acknowledgements
@pride_ebi
[email protected]@ebi.ac.uk
http://code.google.com/p/pride-toolsuite/http://code.google.com/p/pride-converter-2/
colleagues at the PRIDE team
4/23
Mass spectrometry
April 2, 2012
analytical technique measuring the mass-to-charge (m/z) ratio of charged particles to determine masses of particles, composition of samples/molecules and chemical structures of molecules
5/23
Shotgun/bottom-up proteomics
MS analysis
MS/MS analysis
fragmentation
PROTOCOL
peptides
proteins
sequencedatabase
April 2, 2012
6/23
April 2, 2012
What is a PRIDE submission?
7/23
April 2, 2012
growth of core data types 130 million
23 million
4.6 million
8/23
April 2, 2012
Manual curation and submission process
SearchEngine + spectra
Mascot (.dat),X!Tandem (.xml) + mgf
PRIDE Converter
pride xml
9/23
April 2, 2012
PRIDE Inspector
more flexible than web interface
initial assessment on data quality
visualise/check data
summary charts
support for submitters &reviewers/editors
10/23
April 2, 2012
Frequent Data Quality Issues
3.inconsistent/incorrect data
1. syntactic problems
2a. core data missing
2b. or metadata missing
<SearchEngine>PeptideShaker</SearchEngine><PeptideItem>
no protein/peptide identifications
no species
protein modifications
11/23
April 2, 2012
Delta m/z of detected peptide precursors
experimental precursor ion m/z - theoretical precursor ion m/z
source of delta m/z outliers: incorrect or missing protein modifications and charge state misassignments
12/23
April 2, 2012
Fixing modifications based on delta m/z outliers
13/23
April 2, 2012
Fixing modifications based on delta m/z outliers
14/23
April 2, 2012
but the manual approach does not scale!
15/23
April 2, 2012
10 times as many & big submissions/ day?
16/23
April 2, 2012
single point of submission of data to the main repositories to encourage data exchange
Individualsubmissions
Large-scalesubmissions
EBIPRIDE Raw files
archive
PeptideAtlas
Users
Published Raw Reprocessed
UniProtOther DBs
(GPMDB, …)
17/23
April 2, 2012
PX ToolPX Tool ValidationValidation SubmissionSubmission PublicationPublication ProteomeCentral
ProteomeCentral
FilesFiles
Raw Files
PRIDE XML
Summary
PX submission pipeline
18/23
April 2, 2012
Automated regular submission pipeline
curation-submission time is ~1/6th of manual time
actionable curation summary
Filename size Species #Proteins #Peptides #Spectra #Unid-d spectra
PTMs % delta m/z outlier
22143.xml
3.3 GB Homo sapiens
4128 60544 184209 spectra
123665 spectra
3 0.0
number of files: 3Project: Combined personal saliva proteome and microbioproteomeXML generator software PRIDE Converter Toolsuite 2.0-SNAPSHOT
19/23
April 2, 2012
Conclusion
growing amount of data
scalability issues
overcoming them by automation and new, smarter curation strategies
growingly complex data
20/23
April 2, 201221/23
April 2, 2012
Thanks for the attention!
22/23