23
PRIDE: Quality control in a proteomics data repository Attila Csordas Proteomics Services Team Biocuration Conference April 2nd, 2012 1/23

Pride quality controlattilacsordasbiocuration2012

Embed Size (px)

DESCRIPTION

The ppt version of a talk I gave at the Biocuration 2012 Conference in Washington DC at Georgetown University in front of ~300 people.

Citation preview

Page 1: Pride quality controlattilacsordasbiocuration2012

PRIDE: Quality control in a proteomics data repository

Attila Csordas

Proteomics Services Team

Biocuration Conference

April 2nd, 2012

1/23

Page 2: Pride quality controlattilacsordasbiocuration2012

April 2, 2012

Overview

who are we?

what are we dealing with?

manual curation and submission

quick detour: ProteomeXchange

automated curation & submission pipeline

conclusion

2/23

Page 3: Pride quality controlattilacsordasbiocuration2012

April 2, 2012

The PRoteomics IDEntifications database is a centralised,

primary, archival, public data repository for MS/MS proteomics data containing peptide ids, protein ids, mass spectra, protein expression values,

metadata.

PRIDE: http://www.ebi.ac.uk/pride

3/23

Page 4: Pride quality controlattilacsordasbiocuration2012

April 2, 2012

Acknowledgements

@pride_ebi

[email protected]@ebi.ac.uk

http://code.google.com/p/pride-toolsuite/http://code.google.com/p/pride-converter-2/

colleagues at the PRIDE team

4/23

Page 5: Pride quality controlattilacsordasbiocuration2012

Mass spectrometry

April 2, 2012

analytical technique measuring the mass-to-charge (m/z) ratio of charged particles to determine masses of particles, composition of samples/molecules and chemical structures of molecules

5/23

Page 6: Pride quality controlattilacsordasbiocuration2012

Shotgun/bottom-up proteomics

MS analysis

MS/MS analysis

fragmentation

PROTOCOL

peptides

proteins

sequencedatabase

April 2, 2012

6/23

Page 7: Pride quality controlattilacsordasbiocuration2012

April 2, 2012

What is a PRIDE submission?

7/23

Page 8: Pride quality controlattilacsordasbiocuration2012

April 2, 2012

growth of core data types 130 million

23 million

4.6 million

8/23

Page 9: Pride quality controlattilacsordasbiocuration2012

April 2, 2012

Manual curation and submission process

SearchEngine + spectra

Mascot (.dat),X!Tandem (.xml) + mgf

PRIDE Converter

pride xml

9/23

Page 10: Pride quality controlattilacsordasbiocuration2012

April 2, 2012

PRIDE Inspector

more flexible than web interface

initial assessment on data quality

visualise/check data

summary charts

support for submitters &reviewers/editors

10/23

Page 11: Pride quality controlattilacsordasbiocuration2012

April 2, 2012

Frequent Data Quality Issues

3.inconsistent/incorrect data

1. syntactic problems

2a. core data missing

2b. or metadata missing

<SearchEngine>PeptideShaker</SearchEngine><PeptideItem>

no protein/peptide identifications

no species

protein modifications

11/23

Page 12: Pride quality controlattilacsordasbiocuration2012

April 2, 2012

Delta m/z of detected peptide precursors

experimental precursor ion m/z - theoretical precursor ion m/z

source of delta m/z outliers: incorrect or missing protein modifications and charge state misassignments

12/23

Page 13: Pride quality controlattilacsordasbiocuration2012

April 2, 2012

Fixing modifications based on delta m/z outliers

13/23

Page 14: Pride quality controlattilacsordasbiocuration2012

April 2, 2012

Fixing modifications based on delta m/z outliers

14/23

Page 15: Pride quality controlattilacsordasbiocuration2012

April 2, 2012

but the manual approach does not scale!

15/23

Page 16: Pride quality controlattilacsordasbiocuration2012

April 2, 2012

10 times as many & big submissions/ day?

16/23

Page 17: Pride quality controlattilacsordasbiocuration2012

April 2, 2012

single point of submission of data to the main repositories to encourage data exchange

Individualsubmissions

Large-scalesubmissions

EBIPRIDE Raw files

archive

PeptideAtlas

Users

Published Raw Reprocessed

UniProtOther DBs

(GPMDB, …)

17/23

Page 18: Pride quality controlattilacsordasbiocuration2012

April 2, 2012

PX ToolPX Tool ValidationValidation SubmissionSubmission PublicationPublication ProteomeCentral

ProteomeCentral

FilesFiles

Raw Files

PRIDE XML

Summary

PX submission pipeline

18/23

Page 19: Pride quality controlattilacsordasbiocuration2012

April 2, 2012

Automated regular submission pipeline

curation-submission time is ~1/6th of manual time

actionable curation summary

Filename size Species #Proteins #Peptides #Spectra #Unid-d spectra

PTMs % delta m/z outlier

22143.xml

3.3 GB Homo sapiens

4128 60544 184209 spectra

123665 spectra

3 0.0

number of files: 3Project: Combined personal saliva proteome and microbioproteomeXML generator software PRIDE Converter Toolsuite 2.0-SNAPSHOT

19/23

Page 20: Pride quality controlattilacsordasbiocuration2012

April 2, 2012

Conclusion

growing amount of data

scalability issues

overcoming them by automation and new, smarter curation strategies

growingly complex data

20/23

Page 21: Pride quality controlattilacsordasbiocuration2012

April 2, 201221/23

Page 22: Pride quality controlattilacsordasbiocuration2012

April 2, 2012

Thanks for the attention!

22/23

Page 23: Pride quality controlattilacsordasbiocuration2012

April 2, 2012

[email protected] @attilacsordasQ&A

23/23