Upload
lel
View
19
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Hidden unit weights in network model correlations. Compartments in the eukaryotic cell. Protein targeting/localization signals. Signal peptide Mitochondrial targeting peptide Chloroplast targeting peptide LPxTG sorting signal Peroxisomal targeting signal (PTS2) Signal anchor - PowerPoint PPT Presentation
Citation preview
Hidden unit weights in network model correlations
Compartments in the eukaryotic cell
Protein targeting/localization signals
• Signal peptide• Mitochondrial targeting peptide• Chloroplast targeting peptide• LPxTG sorting signal • Peroxisomal targeting signal (PTS2)• Signal anchor• Nuclear localization signal• ER/Golgi retention signal • Peroxisomal targeting signal (PTS1)• Transmembrane helices
Cleaved
Uncleaved
Classical secretory pathway
The secretory signal peptide
Targeting to the ER
Eukaryotic signal peptide logo
Characteristics of signal peptides
Length n-region h-region c-region -3, -1
Euk 22 only slightly Arg-rich
short, very hydrophobic
short, no pattern
small and neutral
residues
Gram- 25 Lys+Arg-rich slightly longer, less
hydrophobic
short, Ser+Ala-
rich
almost exclusively
Ala
Gram+ 32 Lys+Arg-rich very long, less hydrophobic
longer, Thr+Pro-
rich
almost exclusively
Ala
Prokaryotic signal peptide logos
Gram-positive bacteria
Gram-negative bacteria
Positive and negative training data: secreted versus cytoplasmic and nuclear sequences 130 YGIW_ECOLIMAKFAAVIAVMALCSAPVMAAEQGGFSGPSATQSQAGGFQGPNGSVTTVESAKSLRDDTWVTLRGNIVERISDDLYVFKD 80ASGTINVDIDHKRWNGVTVTPKDTVEIQGEVDKDWNSVEIDVKQIRKVNP 160SSSSSSSSSSSSSSSSSSSSCMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 80MMMMMMMMMMMMMMMMMMM------------------------------- 160 184 PMFA_PROMIMKLSKIALAAALVFGINSVATAENETPAPKVSSTKGEIQLKGEIVNSACGLAASSSPVIVDFSEIPTSALANLQKAGNIK 80KDIELQDCDTTVAKTATVSYTPSVVNAVNKDLASFVSGNASGAGIGLMDAGSKAVKWNTATTPVQLINGVSKIPFVAYVQ 160AESADAKVTPGEFQAVINFQVDYQ 240SSSSSSSSSSSSSSSSSSSSSSCMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 80MMMMMMMMMMMMMMMMMMM------------------------------------------------------------- 160------------------------ 324 CYSB_KLEAEMKLQQLRYIVEVVNHNLNVSSTAEGLYTSQPGISKQVRMLEDELGIQIFARSGKHLTQVTPAGQEIIRIAREVLSKVDAI 80KSVAGEHTWPDKGSLYVATTHTQARYALPGVIKGFIERYPRVSLHMHQGSPTQIAEAVSKGNADFAIATEALHLYDDLVM 160LPCYHWNRSIVVTPEHPLATKASVSIEELAQYPLVTYTFGFTGRSELDTAFNRAGLTPRIVFTATDADVIKTYVRLGLGV 240GVIASMAVDPVSDPDLVKLDANGIFSHSTTKIGFRRSTFLRSYMYDFIQRFAPHLTRDVVDTAVALRSNEDIEAMFKDIK 320LPEK 400MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 80MMMMMMMMMMMMMMMMMMM------------------------------------------------------------- 160-------------------------------------------------------------------------------- 240-------------------------------------------------------------------------------- 320---- 400 157 SBMC_ECOLIMNYEIKQEEKRTVAGFHLVGPWEQTVKKGFEQLMMWVDSKNIVPKEWVAVYYDNPDETPAEKLRCDTVVTVPGYFTLPEN 80SEGVILTEITGGQYAVAVARVVGDDFAKPWYQFFNSLLQDSAYEMLPKPCFEVYLNNGAEDGYWDIEMYVAVQPKHH 160MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 80MMMMMMMMMMMMMMMMMMM---------------------------------------------------------- 160
Data partitioning for training and test
Remove highly similar sequences from data set, where cleavage siteInformation reliably can be transferred by alignment.
A redundancy reduced data set can be used to make, say five-fold cross-validation.
The training set may ideally contain equal amounts of sequences with negative and positive examples.
Training
Test
Sliding window
Sequence: MAKFAAVIAVMALCSAPVMAAEQGGFSGPSATQSQAGGFQGPNGSVTTVES ...
Window size here is 9 (example)
Window 1: MAKFAAVIAWindow 2: AKFAAVIAVWindow 3: KFAAVIAVMWindow 4: FAAVIAVMA...Window 10: VMALCSAPV...
For signal peptide prediction typically the first 70 aa of positive and negative sequenes are used.
Graphical output from SignalP
Alternative start codon “prediction”
Symmetric and asymmetric neural network window sizes
SignalP uses two different networks for signal peptide prediction:
• Cleavage site prediction network (C-score)• Signal peptide vs. non-signal peptide discrimination network (S-score)
An asymmetric window is used for cleavage site prediction (more information are found upstream of the cleavage site (see logo))
A symmetric window is used for discrimination between signal peptide windows and mature protein windows
Neural network windows in SignalP
MAKFAAVIAVMALCSAPVMAAEQGGFSGPSATQSQAGGFQGPN
MAKFAAVIAVMALCSAPVMAAEQGGFSGPSATQSQAGGFQGPN
Asymmetric window
Symmetric window
Cleavage
Performance calculation
fntp
tp
ySensitivit
fn) fp)(tn fp)(tn fn)(tp (tp
fp · fntp · tn -
cc
tp: true positivetn: true negativefp: false positivefn: false negative
fptp
tp
ySpecificit
Optimization of window sizes
Optimization of window sizes for SignalP version 3.0
NN window sizes for SignalP 3.0
Cleavage site network
Discrimination network
Window Hidden Window Hidden
Euk 19+4 2 27 4
Gram- 11+3 2 19 3
Gram+ 21+2 0 19 3
Window sizes used in the final method
An asymmetric window is best for the cleavage site prediction,whereas symmetric windows is best for discrimination.
SignalP 3.0 architecture
...
...
I1 I2 I3
H1 H2 H3
O1
Input layer
Weights
Hidden layer
Output layer
Weights
O2
Input sequence data
I
H
I
H
Sequence composition
Window position
In addition to sequence input, composition (entire sequence) and position of the sliding window was used in the neural network of SignalP 3.0
Implementation of position neuron
RLAV = 24 IF (LET .LT. RLAV) THEN X = REAL(LET)/REAL(RLAV) ELSEIF (REAL(LET) .GT. 2.0*RLAV) THEN X = 0.0 ELSE X = 1.0 - ((REAL(LET)-RLAV)/REAL(RLAV)) ENDIF
MKLLQRGVALALLTTFTLASETALAYEQDKTYKITVLHTNDHHGHFWRNEYGEYGLAAQK
Fortran code
1
24 Position in sequence
Input to NN
0
0 48
Composition of secretory vs. non-secretory proteins
Composition weights
What is new in SignalP version 3.0!
• Data set– From SWISS-PROT rel. 40.0– Highly curated– Cleaned for spurious residues at pos. -1
• Length and composition– improves the performance significantly– Length improves both discrimination and cleavage performance– Composition improves discrimination
• D-score– Average of mean-S score and Y-max score – Better discrimination
Database annotation errors
• Some of the manually curated databases contain obvious errors that can be eliminated
• General ``SIGNAL´´ errors– Signal peptide include propeptide– Wrong signal peptide cleavage site– The secreted protein is processed by proteases– Wrong start codon used– Signal peptide of different class, ie. TAT or bacteriocin
(prokaryote)
Signal peptide or propeptide
N –
S igna l peptide
P ropeptide
M ature pro te in
Signal peptide or propeptide
Propeptide cleavage
Signal peptide cleavage
Isoelectric point calculations
Improvement by length and composition
Performance of three different SignalP versions
VersionCleavage site (Y-score) Discrimination (SP/non-SP)
Euk Gram- Gram+ Euk Gram- Gram+
SignalP1 NN 70.2 79.3 67.9 0.97 0.88 0.96
SignalP2 NN 72.4 83.4 67.4 0.97 0.90 0.96
SignalP2 HMM 69.5 81.4 64.5 0.94 0.93 0.96
SignalP3 NN 79.0 92.5 85.0 0.98 0.95 0.98
SignalP3 HMM 75.7 90.2 81.6 0.94 0.94 0.98
SignalP paper now has more than 2500 citations.
Exons and introns: discontinous protein coding regions in eukaryotes
Two ways to solve the problem
Predict splice sites (GT-donor and AG-acceptor)
or
Predict coding versus non-coding
(at least in non-UTRs)
C C T G G A C C G G G T G A
0.12 0.11 0.10
C T G G A C C G G G T G A C
0.12 0.11 0.10 0.14
T G G A C C G G G T G A C G
0.12 0.11 0.10 0.14 0.23
Splice site networks overpredict a lot
Combination of splice site and coding/non-coding networks
Combinationof splice siteand coding/non-codingnetworks
1 HUMA1ATP TACATCTTCTTTAAAGGTAAGGTTGCTCAACCA 1 HUMA1ATP CCTGAAGCTCTCCAAGGTGAGATCACCCTGACG 1 HUMACCYBA CCACACCCGCCGCCAGGTAAGCCCGGCCAGCCG 1 HUMACCYBA CGAGAAGATGACCCAGGTGAGTGGCCCGCTACC 1 HUMACTGA GCGCCCCAGACACCAGGTGAGTGGATGGCGCCG 1 HUMACTGA AGAGAAGATGACTCAGGTGAGGCTCGGCCGACG 1 HUMACTGA CACCATGAAGATCAAGGTGAGTCGAGGGGTTGG 1 HUMADAG TCTTATACTATGGCAGGTAAGTCCATACAGAAG 1 HUMALPHA CGTGGCTCTGTCCAAGGTAAGTGCTGGGCTACC 1 HUMALPI CCTGGCTCTGTCCAAGGTAAGGGCTGGGCCACC 1 HUMALPPD TGTGGCTCTGTCCAAGGTAAGTGCTGGGCTACC 1 HUMAPRTA CCTGGAGTACGGGAAGGTAAGAGGGCTGGGGTG 1 HUMCAPG GAAGGCTGCCTTCAAGGTAAGGCATGGGCATTG 1 HUMCFVII GGAGTGTCCATGGCAGGTAAGGCTTCCCCTGGC 1 HUMCP21OH CACCTTGGGCTGCAAGGTGAGAGGCTGATCTCG 1 HUMCP21OHC CACCTTGGGCTGCAAGGTGAGAGGCTGATCTCG 1 HUMCS1 GTGGCAATGGCTCCAGGTAAGCGCCCCTAAAAT 1 HUMCSFGMA AATGTTTGACCTCCAGGTAAGATGCTTCTCTCT 1 HUMCSPB AAAGACTTCCTTTAAGGTAAGACTATGCACCTG 1 HUMCSFGMA AATGTTTGACCTCCAGGTAAGATGCTTCTCTCT 1 HUMCSPB AAAGACTTCCTTTAAGGTAAGACTATGCACCTG 1 HUMCYC1A GCTACGGACACCTCAGGTGAGCGCTGGGCCGGG ... 2 HUMA1ATP CCTGGGACAGTGAATCGTAAGTATGCCTTTCAC 2 HUMA1ATP AAAATGAAGACAGAAGGTGATTCCCCAACCTGA 2 HUMA1GLY2 CGCCACCCTGGACCGGGTGAGTGCCTGGGCTAG 2 HUMA1GLY2 GAGAGTACCAGACCCGGTGAGAGCCCCCATTCC 2 HUMA1GLY2 ACCGTCTCCAGATACGGTGAGGGCCAGCCCTCA 2 HUMA1GLY2 GGGCTGTCTTTCTATGGTAGGCATGCTTAGCAG 2 HUMA1GLY2 CACCGACTGGAAAAAGGTAAACGCAAGGGATTG 2 HUMACCYBA GCGCCCCAGGCACCAGGTAGGGGAGCTGGCTGG 2 HUMACCYBA CAGCCTTCCTTCCTGGGTGAGTGGAGACTGTCT 2 HUMACCYBA CACAATGAAGATCAAGGTGGGTGTCTTTCCTGC 2 HUMACTGA TCGCGTTTCTCTGCCGGTGAGCGCCCCGCCCCG 2 HUMADAG CTTCGACAAGCCCAAAGTGAGCGCGCGCGGGGG 2 HUMADAG TGTCCAGGCCTACCAGGTGGGTCCTGTGAGAAG 2 HUMADAG CGAAGTAGTAAAAGAGGTGAGGGCCTGGGCTGG ... 11 HUMCS1 AACGCAACAGAAATCCGTGAGTGGATGCCGTCT 11 HUMGHN AACACAACAGAAATCCGTGAGTGGATGCCTTCT 52 HUMHSP90B CTCTAATGCTTCTGATGTAGGTGCTCTGGTTTC 80 HUMMETIF1 ACCTCCTGCAAGAAGAGTGAGTGTGAGGCCATC 112 HUMHSP90B ATACCAGAGTATCTCAGTGAGTATCTCCTTGGC 113 HUMHST GCGGACACCCGCGACAGTGAGTGGCGCGGCCAG 113 HUMLACTA GACATCTCCTGTGACAGTGAGTAGCCCCTATAA 151 HUMKAL2 ATCGAACCAGAGGAGTGTACGCCTGGGCCAGAT 157 HUMCS1 CACCTACCAGGAGTTTGTAAGTTCTTGGGGAAT 157 HUMGHN CACCTACCAGGAGTTTGTAAGCTCTTGGGGAAT 164 HUMALPHA CAACATGGACATTGATGTGCGACCCCCGGGCCA 622 HUMCFVII CTGATCGCGGTGCTGGGTGGGTACCACTCTCCC 636 HUMADAG CCTGGAACCAGGCTGAGTGAGTGATGGGCCTGG 895 HUMAPOCIB TCCAGCAAGGATTCAGGTTGTTGAGTGCTTGGG 970 HUMALPHA CGGGCCAAGAAAGCAGGTGGAGCTGGGGCCCGG2114 HUMAPRTA ATCGACTACATCGCAGGCGAGTGCCAGTGGCCG
Neural network weight analysis: reading frame detection
Exon-intron transistion detection units