Upload
william-butler
View
225
Download
1
Tags:
Embed Size (px)
Citation preview
Novel Peptide Identification using
ESTs and Genomic Sequence
Novel Peptide Identification using
ESTs and Genomic Sequence
Nathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park
2
Novel Peptides
• Absent from traditional protein sequence databases• IPI, SwissProt, TrEMBL, NCBI’s nr, MSDB
• Due to• Deliberate “redundancy” elimination• “Dark-side” genes• Bias towards high-quality, high-confidence
full-length protein sequence
3
What is missing?
• Known coding SNPs
• Novel coding mutations
• Alternative splicing isoforms
• Alternative translation start-sites
• Microexons
• Alternative translation frames
4
Why should we care?
• Alternative splicing is the norm!• Only 20-25K human genes• Each gene makes many proteins
• Proteins have clinical implications• Biomarker discovery
• Evidence for SNPs and alternative splicing stops with transcription• Genomic assays, ESTs, mRNA sequence.• No hard evidence for translation start site
5
Novel Protein
HEQASNVLSDISEFREvidence:• log10(E-value) = -9.6• 100’s of ESTs• Full length mRNA sequence Details:• Peptide Atlas A8_IP (Resing et al.);
7
Novel Protein
8
Novel Protein
9
Novel Splice Isoform
LQGSATAAEAQVGHQTAR Evidence:• log10(E-value) = -6.8• 10’s of ESTs• Full length mRNA sequenceDetails:• Peptide Atlas raftflow (von Haller, et al.); • LIME1 gene
10
Novel Splice Isoform
11
Novel Splice Isoform
12
Novel Splice Isoform
13
Novel Frame
TAGSPLCLPTPGAAPGSAGSCSHREvidence:• log10(E-value) = -3.9• 10’s of ESTs• Full length mRNA sequenceDetails:• Peptide Atlas raftflow (von Haller, et al.); • LIME1 gene, downstream from LQGSA...
14
Novel Frame
15
Novel Frame
16
Novel Frame
17
“Novel” Microexon
LQTASDESYKDPTNIQLSKEvidence:• log10(E-value) = -6.4• 10’s of ESTs / mRNA sequences• SwissProt variant, absent from IPIDetails:• Peptide Atlas raftflow (von Haller, et al.); • SPTAN1 gene
18
“Novel” Microexon
19
“Novel” Microexon
20
“Novel” Microexon
21
“Novel” Microexon
22
Novel Mutation
KADDTWEPFASGK Evidence:• log10(E-value) = -7.6• 2 ESTs from same clone library• Ala2 DeletionDetails:• HUPO PPP 29_b1-EDTA_1 (Qian/He; Omenn et al.); • TTR gene• Known Mutation: Ala2-to-Pro associated with
familial amyloidotic polyneuropathy.
23
Novel Mutation
24
Novel Mutation
25
Novel Mutation
26
Novel Mutation
27
Known Coding SNP
DTEEEDFHVDQ[V|A]TTVKEvidence:• log10(E-value) = -9.5 / -9.4• Known dbSNP (coding): Val12-to-Ala• Wildtype also observedDetails:• HUPO PPP 40 (Wang; Omenn et al.); • SERPINA1 gene
28
Wildtype
29
Known Coding SNP
30
Known Coding SNP
31
Known Coding SNP
LQHL[E|V]NELTHDIITK Evidence:• log10(E-value) = -6.7/-10.9• 4 ESTs, same clone library• Known dbSNP (coding): Glu5-to-Val• Wildtype also observedDetails:• HUPO PPP 28_b2-CIT
(Pounds/Adkins/Rodland/Anderson; Omenn et al.); • SERPINA1 gene
32
IPI Common Variant Elimination
YYGGGYGSTQATFMVFQALAQYQK Evidence:• log10(E-value) = -5.9• 100’s ESTs, mRNA sequence• IPI has (rare) variant (Insertion of AS@10)• Differ in 5’ splice site.Details:• HUPO PPP 29 (Qian/He; Omenn et al.); • C3 gene
33
Why don’t we see more novel peptides?
• Tandem mass spectrometry doesn’t discriminate against novel peptides...
...but protein sequence databases do!
• Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!
34
Why don’t we see more novel peptides?
• Traditional protein sequence databases• High-quality, full-length proteins only• Many interesting peptides are omitted• Exclusive – peptide identifications are lost.
• ESTs, genomic & mRNA sequence• Used as evidence for full-length protein
sequences• Inclusive – may need to filter results
35
Significant False Positives
• E-values are not enough!• Random guessers are easy to beat.
• Post-translational modifications vs. amino-acid substitution• methylation (on I/L, Q, R, C, H, K, S, T, N): +14• D → E, G → A, V → I/L, N → Q, S → T: +14
• Peptide extension z=+2 → z=+3• Nonsense AA masses sum to precursor
• Need to ensure:• fragment ions define novel sequence• sequence evidence is strong• other plausible explanations can be eliminated
36
Significant False Positives
• DFLAGGLAAAISK 2.2x10-8
• 2 ESTs• DFLAGGIAAAISK 2.2x10-8
• IPI (2), RefSeq, mRNA, ~ 1400 ESTs• DFLAGGVAAAISK 3.7x10-8
• IPI, RefSeq, mRNA, ~700 ESTs• DFLAGGVAAAISKMAVVPI 3.5x10-5
• Genscan exon• AISFAKDFLAGGIAAAISK 3.3x10-4
• Genscan exon
37
Significant False Positives
38
How do we know they are novel?
• How do we know they are real?• Good spectra• Good E-value• Good ion ladders• Good sequence evidence
• Lack of other explanations...
39
Peptide Sequence Evidence
• C3 Compression:• Amino-acid 30-mers • Complete, Correct(, Compact) • Present at least twice (ESTs only)
Gb of Sequence Naïve Enumeration C3Self Corrected ESTs (1) 7.60 2.90 0.18
Genome Corrected ESTs (2) 5.40 1.30 0.12Corrected ESTs (1+2) 13.00 4.20 0.20
Genscan Exons (3) 0.10 0.06 0.05Genscan Exon Pairs (4) 2.30 1.60 0.55
Combo (1+2+3+4) 28.40 10.06 0.78Genomic ORFs (5) 6.20 1.90 1.50
40
SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
41
Compressed-SBH-graph
ACDEFGI
2 2
1
2
1
42
Peptide Sequence Databases
• MS/MS search engine input only• Protein context is lost• Inclusive, rather than exclusive• Download from http://www.umiacs.umd.edu/~nedwards
• Exact string search for gene/protein context• Recover peptide sequence evidence
• Relational database to reassemble......with respect to genes & genome
• Grid Computing + Web Services + Viewer• Work in progress
43
Peptide Identification Navigator
44
Peptide Identification Navigator
45
Conclusions
• Peptides identify more than proteins
• Search EST sequences (at least)
• Compressed peptide sequence databases make this feasible