Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Preview:

DESCRIPTION

Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis. Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center. Why Tandem Mass Spectrometry?. - PowerPoint PPT Presentation

Citation preview

Generalized Protein Parsimony and Spectral Counting for

Functional Enrichment Analysis

Nathan EdwardsDepartment of Biochemistry and

Molecular & Cellular Biology

Georgetown University Medical Center

2

Why Tandem Mass Spectrometry?

LC-MS/MS spectra provide evidence for the amino-acid sequence and abundance of functional proteins.

Key concepts: Spectrum acquisition is unbiased by knowledge Direct observation of amino-acid sequence Sensitive to small sequence variations Spectrum acquisition is biased by abundance

3

Sample Preparation for MS/MS

Enzymatic Digestand

Fractionation

4

Single Stage MS

MS

5

Tandem Mass Spectrometry(MS/MS)

Precursor selection

6

Tandem Mass Spectrometry(MS/MS)

Precursor selection + collision induced dissociation

(CID)

MS/MS

7

Peptide Fragmentation

Peptide: S-G-F-L-E-E-D-E-L-K

y1

y2

y3

y4

y5

y6

y7

y8

y9

ion

1020

907

778

663

534

405

292

145

88

MW

762SGFL EEDELKb4

389SGFLEED ELKb7

MWion

633SGFLE EDELKb5

1080S GFLEEDELKb1

1022SG FLEEDELKb2

875SGF LEEDELKb3

504SGFLEE DELKb6

260SGFLEEDE LKb8

147SGFLEEDEL Kb9

8

Unannotated Splice Isoform

Human Jurkat leukemia cell-line Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP 2003.

LIME1 gene: LCK interacting transmembrane adaptor 1

LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias.

Multiple significant peptide identifications

9

Unannotated Splice Isoform

10

Unannotated Splice Isoform

11

Translation start-site correction

Halobacterium sp. NRC-1 Extreme halophilic Archaeon, insoluble membrane

and soluble cytoplasmic proteins Goo, et al. MCP 2003.

GdhA1 gene: Glutamate dehydrogenase A1

Multiple significant peptide identifications Observed start is consistent with Glimmer 3.0

prediction(s)

12

Halobacterium sp. NRC-1ORF: GdhA1

K-score E-value vs PepArML @ 10% FDR Many peptides inconsistent with annotated

translation start site of NP_279651

0 40 80 120 160 200 240 280 320 360 400 440

13

Lost peptide identifications

Missing from the sequence database

Search engine strengths, weaknesses, quirks

Poor score or statistical significance

Thorough search takes too long

14

All amino-acid 30-mers, no redundancy From ESTs, Proteins, mRNAs

30-40 fold size and search time reduction Formatted as a FASTA sequence database One entry per gene/cluster.

Peptide Sequence Databases

Organism Size (AA) Size (Entries)Human 248Mb 74,976Mouse 171Mb 55,887

Rat 76Mb 42,372Zebra-fish 94Mb 40,490

15

Combine search engine results

No single score is comprehensive

Search engines disagree

Many spectra lack confident peptide assignment

Searle et al. JPR 7(1), 2008

38%

14%28%

14%

3%

2%

1%

X! Tandem

SEQUESTMascot

16

Combining search engine results – harder than it looks!

Consensus boosts confidence, but... How to assess statistical significance? Gain specificity, but lose sensitivity! Incorrect identifications are correlated too!

How to handle weak identifications? Consensus vs disagreement vs abstention Threshold at some significance?

We apply "unsupervised" machine-learning.... Lots of related work unified in a single framework.

Search Engine Info. Gain

17

Mascot OMSSATandem

Train Classifier & Predict Correct IDs

Stable?

Ouput Peptide Spectrum Assignments

Spectra

No

Yes

Recalibrate Confidence as FDR (D1)

Select "True" Proteins

Extract Peptides & Features

Select High-Quality IDs (D0)

Assign Training Labels

Select "True" Proteins

. . . . . .PepArML Workflow

Select high-quality IDs Guess true proteins from

search results Label spectra & train Calibrate confidence Guess true proteins from

ML results Iterate! Estimate FDR using

(external) decoy18

False-Discovery-Rate Curves

19

20

PepArML Meta-Search EngineNSF TeraGrid1000+ CPUs

Edwards LabScheduler &80+ CPUs

Securecommunication

Heterogeneouscompute resources

Single, simplesearch request

Scales easily to 250+ simultaneous

searches

X!Tandem,KScore,OMSSA,

MyriMatch,Mascot(1 core).

X!Tandem,KScore,OMSSA,

MyriMatch.

Amazon AWS

21

PeptideMapper Web Service

I’m Feeling Lucky

22

PeptideMapper Web Service

I’m Feeling Lucky

23

PeptideMapper Web Service

Suffix-tree index on peptide sequence database Fast peptide to gene/cluster mapping “Compression” makes this feasible

Peptide alignment with cluster evidence Amino-acid or nucleotide; exact & near-exact

Genomic-loci mapping via UCSC “known-gene” transcripts, and Predetermined, embedded genomic coordinates

molecular biology ↕

phenotype

Systems Biology

24

KnowledgeDatabases

Structured High-Throughput

Experiments• Localization• Function• Process• Interactions• Pathway• Mutation

• Proteomics• Sequencing• Microarrays• Metabolomics

molecular biology↕

biology

molecular biology ↕

phenotype

Systems Biology

25

MathematicalModels

Structured High-Throughput

Experiments• Localization• Function• Process• Interactions• Pathway• Mutation

• Proteomics• Sequencing• Microarrays• Metabolomics

molecular biology↕

biology

KnowledgeDatabasesFunctional

AnnotationEnrichment

molecular biology ↕

phenotype

Systems Biology

26

MathematicalModels

Structured High-Throughput

Experiments• Localization• Function• Process• Interactions• Pathway• Mutation

• Proteomics• Sequencing• Microarrays• Metabolomics

molecular biology↕

biology

KnowledgeDatabasesFunctional

AnnotationEnrichment

Why not in proteomics?

Double counting and false positives… …due to traditional protein inference

Proteomics cannot see all proteins… …proteins are not equally likely to be drawn

Good relative abundance is hard… …extra chemistries, workflows, and software …missing values are particularly problematic

27

In proteomics…

Double counting and false positives… Use generalized protein parsimony

Proteomics cannot see all proteins… Use identified proteins as background

Good relative abundance is hard… Model differential spectral counts directly

28

Traditional Protein Parsimony

Select the smallest set of proteins that explain all identified peptides.

Sensible principle, implies Eliminate equivalent/subset proteins

Equivalent proteins are problematic: Which one to choose?

Unique-protein peptides force the inclusion of proteins into solution True for most tools, even probability based ones Bad consequences for FDR filtered ids 29

Peptide-Spectrum Matches

Sigma49 – 32,691 LTQ MS/MS spectra of 49 human protein standards; IPI Human

Yeast – 162,420 LTQ MS/MS spectra from a yeast cell lysate; SGD.

X!Tandem E-value (no refinement), 1% FDR

30Spectra used in: Zhang, B.;  Chambers, M. C.;  Tabb, D. L. 2007.

Many proteins are easy

Eliminate equivalent / dominated proteins Sigma49: 277 → 60 proteins Yeast: 1226 → 1085 proteins

Many components have a single protein: Sigma49: 52 ( 3 multi-protein) Yeast: 994 (43 multi-protein)

Single peptides force protein inclusion Sigma49: 16 single-peptide proteins Yeast: 476 single-peptide proteins

31

Must eliminate redundancy

Contained proteins should not be selected

32

IPI00925547 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00298860 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00925299 X X X X X X X X IPI00925519 X X X X X X X IPI00908908 X X X X IPI00903112 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

37 distinct peptides

Must eliminate redundancy

Contained proteins should not be selected Even if they have some probability mass Number of sibling peptides matter less if they are

shared.33

IPI00925547 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00298860 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00925299 X X X X X X X X IPI00925519 X X X X X X X IPI00908908 X X X X IPI00903112 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

1.01.00.80.70.01.0

Single AA Difference

1.00.00.00.00.01.0

Must ignore some PSMs

A single additional peptide should not force protein into solution

34

IPI00925547 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00298860 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00925299 X X X X X X X X IPI00925519 X X X X X X X IPI00908908 X X X X IPI00903112 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

Single AA Difference

Example from Yeast

"Inosine monophosphate dehydrogenase" 4 gene family

Contained proteins should not be selected Single peptide evidence for YML056C

35

YLR432W X X X X X X XYHR216W X X XYAR073W X X YML056C X X X X X X

1.00.60.01.0

Must ignore some PSMs

Improving peptide identification sensitivitymakes things worse! False PSMs don't cluster

36

10%

2xProteins

PSMs

PSMs

Must ignore some PSMs

Improving peptide identification sensitivitymakes things worse! False PSMs don't cluster

37

Select Proteins toExplain True PSM%

PSMs

PSMs

90%

90%

Must ignore some PSMs

How do we choose? Maximize # peptides? Minimize FDR (naïve model)? Maximize # PSMs?

38

IPI00925547 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00298860 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00925299 X X X X X X X X IPI00925519 X X X X X X X IPI00908908 X X X X IPI00903112 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

YLR432W X X X X X X XYHR216W X X XYAR073W X X YML056C X X X X X X

Generalized Protein Parsimony

Weight peptides by number of PSMs Constrain unique peptides per protein Maximize explained peptides (PSMs)

Match PSM filtering FDR to % uncovered PSMs

Readily solved by branch-and-bound Permits complex protein/peptide constraints

Reduces to traditional protein parsimony39

Match uncovered PSMs to FDR

40

Plasma membrane enrichment

Pellicle enrichment of plasma membrane Choksawangkarn et al. JPR 2013 (Fenselau Lab)

Six replicate LC-MS/MS analyses each Cell-lysate (44,861 MS/MS) Fe3O4-Al2O3 pellicle (21,871 MS/MS)

625 3-unique proteins to match 10% FDR: Lysate: 18,976 PSMs; Pellicle: 13,723 PSMs 89 proteins with significantly (< 10-5) increased counts

41

Semi-quantitative LC-MS/MS

42

Precursor selection + collision induced dissociation

(CID)

MS/MS

Semi-quantitative LC-MS/MS

43

Chen and Yates. Molecular Oncology, 2007

Plasma membrane enrichment

Na/K+ ATPase subunit alpha-1 (P05023): Lysate: 1; Pellicle: 90; p-value: 5.2 x 10-33

Transferrin receptor protein 1 (P02786): Lysate: 17; Pellicle: 63; p-value: 2.0 x 10-11

DAVID Bioinformatics analysis (89/625): Plasma membrane (GO:0005886) : 29 (5.2 x 10-5) Transmembrane (SwissProtKW): 24 (1.3 x 10-6)

Transmembrane (SwissProtKW): Lysate: 524; Pellicle: 1335; p-value: 2.6 x 10-158

44

Distribution of p-values (Yeast)

45

A protein's PSMs rise and fall together!

46

A protein's PSMs rise and fall together?

47

Anomalies indicate proteoforms

48

HER2/Neu Mouse Model of Breast Cancer

Paulovich, et al. JPR, 2007 Study of normal and tumor mammary tissue

by LC-MS/MS 1.4 million MS/MS spectra

Peptide-spectrum assignments Normal samples (Nn): 161,286 (49.7%) Tumor samples (Nt): 163,068 (50.3%)

4270 proteins identified in total 2-unique generalized protein parsimony

49

Nascent polypeptide-associated complex subunit alpha

50

7.3 x 10-8

51

Pyruvate kinase isozymes M1/M22.5 x 10-5

52

Summary

Improve the scope and sensitivity of peptide identification for genome annotation, using

Exhaustive peptide sequence databases Machine-learning for combining Meta-search tools to maximize consensus Grid-computing for thorough search

Summary

Functional annotation enrichment for proteomics too: Careful counting (generalized parsimony) Differential abundance by spectral counts

Use (multivariate-)hypergeometric model for Differential abundance by spectral counts Proteoform detection

53

Recommended