Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

Novel Peptide Identification using

ESTs and Genomic Sequence

Novel Peptide Identification using

ESTs and Genomic Sequence

Nathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park

2

Novel Peptides

• Absent from traditional protein sequence databases• IPI, SwissProt, TrEMBL, NCBI’s nr, MSDB

• Due to• Deliberate “redundancy” elimination• “Dark-side” genes• Bias towards high-quality, high-confidence

full-length protein sequence

3

What is missing?

• Known coding SNPs

• Novel coding mutations

• Alternative splicing isoforms

• Alternative translation start-sites

• Microexons

• Alternative translation frames

4

Why should we care?

• Alternative splicing is the norm!• Only 20-25K human genes• Each gene makes many proteins

• Proteins have clinical implications• Biomarker discovery

• Evidence for SNPs and alternative splicing stops with transcription• Genomic assays, ESTs, mRNA sequence.• No hard evidence for translation start site

5

Novel Protein

HEQASNVLSDISEFREvidence:• log10(E-value) = -9.6• 100’s of ESTs• Full length mRNA sequence Details:• Peptide Atlas A8_IP (Resing et al.);

6

Novel Protein

http://codon.umiacs.umd.edu:8891/thegpm-cgi/peptide.pl?path=/tandem/archive/GPM00300000833.3.xml&uid=61235&label=AAAACWFI&homolog=AAAACWFI&id=1962.1.1&proex=-1

7

Novel Protein

http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=68062273&hgt.in1=1.5x&position=chrX%3A122566218-122566301

8

Novel Protein

http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=68062273&hgt.in1=1.5x&position=chrX%3A122562855-122562923

9

Novel Splice Isoform

LQGSATAAEAQVGHQTAR Evidence:• log10(E-value) = -6.8• 10’s of ESTs• Full length mRNA sequenceDetails:• Peptide Atlas raftflow (von Haller, et al.); • LIME1 gene

10


http://codon.umiacs.umd.edu:8891/thegpm-cgi/peptide.pl?path=/tandem/archive/GPM00300000340.3.xml&uid=53361&label=AAAACKOM&homolog=AAAACKOM&id=895.1.1&proex=-1

11


12


13

Novel Frame

TAGSPLCLPTPGAAPGSAGSCSHREvidence:• log10(E-value) = -3.9• 10’s of ESTs• Full length mRNA sequenceDetails:• Peptide Atlas raftflow (von Haller, et al.); • LIME1 gene, downstream from LQGSA...

14

Novel Frame

15

Novel Frame

16

Novel Frame

17

“Novel” Microexon

LQTASDESYKDPTNIQLSKEvidence:• log10(E-value) = -6.4• 10’s of ESTs / mRNA sequences• SwissProt variant, absent from IPIDetails:• Peptide Atlas raftflow (von Haller, et al.); • SPTAN1 gene

18


19


20


21


22

Novel Mutation

KADDTWEPFASGK Evidence:• log10(E-value) = -7.6• 2 ESTs from same clone library• Ala2 DeletionDetails:• HUPO PPP 29_b1-EDTA_1 (Qian/He; Omenn et al.); • TTR gene• Known Mutation: Ala2-to-Pro associated with

familial amyloidotic polyneuropathy.

23

Novel Mutation

http://codon.umiacs.umd.edu:8891/thegpm-cgi/peptide.pl?path=/tandem/archive/GPM00300002887.18.xml&uid=202568&label=AAAKEPZA&homolog=AAAKEPZA&id=1838.1.1&proex=-1

24

Novel Mutation

http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=68063647&hgt.out1=1.5x&position=chr18%3A27426956-27426991

25

Novel Mutation

http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=68063647&hgt.right3=%3E%3E%3E&position=chr18%3A27428999-27429052

26

Novel Mutation

27

Known Coding SNP

DTEEEDFHVDQ[V|A]TTVKEvidence:• log10(E-value) = -9.5 / -9.4• Known dbSNP (coding): Val12-to-Ala• Wildtype also observedDetails:• HUPO PPP 40 (Wang; Omenn et al.); • SERPINA1 gene

28

Wildtype

29

Known Coding SNP

30

Known Coding SNP

31

Known Coding SNP

LQHL[E|V]NELTHDIITK Evidence:• log10(E-value) = -6.7/-10.9• 4 ESTs, same clone library• Known dbSNP (coding): Glu5-to-Val• Wildtype also observedDetails:• HUPO PPP 28_b2-CIT

(Pounds/Adkins/Rodland/Anderson; Omenn et al.); • SERPINA1 gene

32

IPI Common Variant Elimination

YYGGGYGSTQATFMVFQALAQYQK Evidence:• log10(E-value) = -5.9• 100’s ESTs, mRNA sequence• IPI has (rare) variant (Insertion of AS@10)• Differ in 5’ splice site.Details:• HUPO PPP 29 (Qian/He; Omenn et al.); • C3 gene

33

Why don’t we see more novel peptides?

• Tandem mass spectrometry doesn’t discriminate against novel peptides...

...but protein sequence databases do!

• Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!

34

Why don’t we see more novel peptides?

• Traditional protein sequence databases• High-quality, full-length proteins only• Many interesting peptides are omitted• Exclusive – peptide identifications are lost.

• ESTs, genomic & mRNA sequence• Used as evidence for full-length protein

sequences• Inclusive – may need to filter results

35

Significant False Positives

• E-values are not enough!• Random guessers are easy to beat.

• Post-translational modifications vs. amino-acid substitution• methylation (on I/L, Q, R, C, H, K, S, T, N): +14• D → E, G → A, V → I/L, N → Q, S → T: +14

• Peptide extension z=+2 → z=+3• Nonsense AA masses sum to precursor

• Need to ensure:• fragment ions define novel sequence• sequence evidence is strong• other plausible explanations can be eliminated

36


• DFLAGGLAAAISK 2.2x10-8

• 2 ESTs• DFLAGGIAAAISK 2.2x10-8

• IPI (2), RefSeq, mRNA, ~ 1400 ESTs• DFLAGGVAAAISK 3.7x10-8

• IPI, RefSeq, mRNA, ~700 ESTs• DFLAGGVAAAISKMAVVPI 3.5x10-5

• Genscan exon• AISFAKDFLAGGIAAAISK 3.3x10-4

• Genscan exon

37


38

How do we know they are novel?

• How do we know they are real?• Good spectra• Good E-value• Good ion ladders• Good sequence evidence

• Lack of other explanations...

39

Peptide Sequence Evidence

• C3 Compression:• Amino-acid 30-mers • Complete, Correct(, Compact) • Present at least twice (ESTs only)

Gb of Sequence Naïve Enumeration C3Self Corrected ESTs (1) 7.60 2.90 0.18

Genome Corrected ESTs (2) 5.40 1.30 0.12Corrected ESTs (1+2) 13.00 4.20 0.20

Genscan Exons (3) 0.10 0.06 0.05Genscan Exon Pairs (4) 2.30 1.60 0.55

Combo (1+2+3+4) 28.40 10.06 0.78Genomic ORFs (5) 6.20 1.90 1.50

40

SBH-graph

ACDEFGI, ACDEFACG, DEFGEFGI

41

Compressed-SBH-graph

ACDEFGI

2 2

1

2

1

42

Peptide Sequence Databases

• MS/MS search engine input only• Protein context is lost• Inclusive, rather than exclusive• Download from http://www.umiacs.umd.edu/~nedwards

• Exact string search for gene/protein context• Recover peptide sequence evidence

• Relational database to reassemble......with respect to genes & genome

• Grid Computing + Web Services + Viewer• Work in progress

43

Peptide Identification Navigator

44

Peptide Identification Navigator

45

Conclusions

• Peptides identify more than proteins

• Search EST sequences (at least)

• Compressed peptide sequence databases make this feasible

Documents

Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,