Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for...

Preview:

Citation preview

Novel Peptide Identification using

ESTs and Genomic Sequence

Novel Peptide Identification using

ESTs and Genomic Sequence

Nathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park

2

Novel Peptides

• Absent from traditional protein sequence databases• IPI, SwissProt, TrEMBL, NCBI’s nr, MSDB

• Due to• Deliberate “redundancy” elimination• “Dark-side” genes• Bias towards high-quality, high-confidence

full-length protein sequence

3

What is missing?

• Known coding SNPs

• Novel coding mutations

• Alternative splicing isoforms

• Alternative translation start-sites

• Microexons

• Alternative translation frames

4

Why should we care?

• Alternative splicing is the norm!• Only 20-25K human genes• Each gene makes many proteins

• Proteins have clinical implications• Biomarker discovery

• Evidence for SNPs and alternative splicing stops with transcription• Genomic assays, ESTs, mRNA sequence.• No hard evidence for translation start site

5

Novel Protein

HEQASNVLSDISEFREvidence:• log10(E-value) = -9.6• 100’s of ESTs• Full length mRNA sequence Details:• Peptide Atlas A8_IP (Resing et al.);

9

Novel Splice Isoform

LQGSATAAEAQVGHQTAR Evidence:• log10(E-value) = -6.8• 10’s of ESTs• Full length mRNA sequenceDetails:• Peptide Atlas raftflow (von Haller, et al.); • LIME1 gene

11

Novel Splice Isoform

12

Novel Splice Isoform

13

Novel Frame

TAGSPLCLPTPGAAPGSAGSCSHREvidence:• log10(E-value) = -3.9• 10’s of ESTs• Full length mRNA sequenceDetails:• Peptide Atlas raftflow (von Haller, et al.); • LIME1 gene, downstream from LQGSA...

14

Novel Frame

15

Novel Frame

16

Novel Frame

17

“Novel” Microexon

LQTASDESYKDPTNIQLSKEvidence:• log10(E-value) = -6.4• 10’s of ESTs / mRNA sequences• SwissProt variant, absent from IPIDetails:• Peptide Atlas raftflow (von Haller, et al.); • SPTAN1 gene

18

“Novel” Microexon

19

“Novel” Microexon

20

“Novel” Microexon

21

“Novel” Microexon

22

Novel Mutation

KADDTWEPFASGK Evidence:• log10(E-value) = -7.6• 2 ESTs from same clone library• Ala2 DeletionDetails:• HUPO PPP 29_b1-EDTA_1 (Qian/He; Omenn et al.); • TTR gene• Known Mutation: Ala2-to-Pro associated with

familial amyloidotic polyneuropathy.

26

Novel Mutation

27

Known Coding SNP

DTEEEDFHVDQ[V|A]TTVKEvidence:• log10(E-value) = -9.5 / -9.4• Known dbSNP (coding): Val12-to-Ala• Wildtype also observedDetails:• HUPO PPP 40 (Wang; Omenn et al.); • SERPINA1 gene

28

Wildtype

29

Known Coding SNP

30

Known Coding SNP

31

Known Coding SNP

LQHL[E|V]NELTHDIITK Evidence:• log10(E-value) = -6.7/-10.9• 4 ESTs, same clone library• Known dbSNP (coding): Glu5-to-Val• Wildtype also observedDetails:• HUPO PPP 28_b2-CIT

(Pounds/Adkins/Rodland/Anderson; Omenn et al.); • SERPINA1 gene

32

IPI Common Variant Elimination

YYGGGYGSTQATFMVFQALAQYQK Evidence:• log10(E-value) = -5.9• 100’s ESTs, mRNA sequence• IPI has (rare) variant (Insertion of AS@10)• Differ in 5’ splice site.Details:• HUPO PPP 29 (Qian/He; Omenn et al.); • C3 gene

33

Why don’t we see more novel peptides?

• Tandem mass spectrometry doesn’t discriminate against novel peptides...

...but protein sequence databases do!

• Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!

34

Why don’t we see more novel peptides?

• Traditional protein sequence databases• High-quality, full-length proteins only• Many interesting peptides are omitted• Exclusive – peptide identifications are lost.

• ESTs, genomic & mRNA sequence• Used as evidence for full-length protein

sequences• Inclusive – may need to filter results

35

Significant False Positives

• E-values are not enough!• Random guessers are easy to beat.

• Post-translational modifications vs. amino-acid substitution• methylation (on I/L, Q, R, C, H, K, S, T, N): +14• D → E, G → A, V → I/L, N → Q, S → T: +14

• Peptide extension z=+2 → z=+3• Nonsense AA masses sum to precursor

• Need to ensure:• fragment ions define novel sequence• sequence evidence is strong• other plausible explanations can be eliminated

36

Significant False Positives

• DFLAGGLAAAISK 2.2x10-8

• 2 ESTs• DFLAGGIAAAISK 2.2x10-8

• IPI (2), RefSeq, mRNA, ~ 1400 ESTs• DFLAGGVAAAISK 3.7x10-8

• IPI, RefSeq, mRNA, ~700 ESTs• DFLAGGVAAAISKMAVVPI 3.5x10-5

• Genscan exon• AISFAKDFLAGGIAAAISK 3.3x10-4

• Genscan exon

37

Significant False Positives

38

How do we know they are novel?

• How do we know they are real?• Good spectra• Good E-value• Good ion ladders• Good sequence evidence

• Lack of other explanations...

39

Peptide Sequence Evidence

• C3 Compression:• Amino-acid 30-mers • Complete, Correct(, Compact) • Present at least twice (ESTs only)

Gb of Sequence Naïve Enumeration C3Self Corrected ESTs (1) 7.60 2.90 0.18

Genome Corrected ESTs (2) 5.40 1.30 0.12Corrected ESTs (1+2) 13.00 4.20 0.20

Genscan Exons (3) 0.10 0.06 0.05Genscan Exon Pairs (4) 2.30 1.60 0.55

Combo (1+2+3+4) 28.40 10.06 0.78Genomic ORFs (5) 6.20 1.90 1.50

40

SBH-graph

ACDEFGI, ACDEFACG, DEFGEFGI

41

Compressed-SBH-graph

ACDEFGI

2 2

1

2

1

42

Peptide Sequence Databases

• MS/MS search engine input only• Protein context is lost• Inclusive, rather than exclusive• Download from http://www.umiacs.umd.edu/~nedwards

• Exact string search for gene/protein context• Recover peptide sequence evidence

• Relational database to reassemble......with respect to genes & genome

• Grid Computing + Web Services + Viewer• Work in progress

43

Peptide Identification Navigator

44

Peptide Identification Navigator

45

Conclusions

• Peptides identify more than proteins

• Search EST sequences (at least)

• Compressed peptide sequence databases make this feasible

Recommended