42
Hidden unit weights in network model correlations

Hidden unit weights in network model correlations

  • Upload
    lel

  • View
    19

  • Download
    1

Embed Size (px)

DESCRIPTION

Hidden unit weights in network model correlations. Compartments in the eukaryotic cell. Protein targeting/localization signals. Signal peptide Mitochondrial targeting peptide Chloroplast targeting peptide LPxTG sorting signal Peroxisomal targeting signal (PTS2) Signal anchor - PowerPoint PPT Presentation

Citation preview

Page 1: Hidden unit weights in network model correlations

Hidden unit weights in network model correlations

Page 2: Hidden unit weights in network model correlations

Compartments in the eukaryotic cell

Page 3: Hidden unit weights in network model correlations

Protein targeting/localization signals

• Signal peptide• Mitochondrial targeting peptide• Chloroplast targeting peptide• LPxTG sorting signal • Peroxisomal targeting signal (PTS2)• Signal anchor• Nuclear localization signal• ER/Golgi retention signal • Peroxisomal targeting signal (PTS1)• Transmembrane helices

Cleaved

Uncleaved

Page 4: Hidden unit weights in network model correlations

Classical secretory pathway

Page 5: Hidden unit weights in network model correlations

The secretory signal peptide

Page 6: Hidden unit weights in network model correlations

Targeting to the ER

Page 7: Hidden unit weights in network model correlations

Eukaryotic signal peptide logo

Page 8: Hidden unit weights in network model correlations

Characteristics of signal peptides

Length n-region h-region c-region -3, -1

Euk 22 only slightly Arg-rich

short, very hydrophobic

short, no pattern

small and neutral

residues

Gram- 25 Lys+Arg-rich slightly longer, less

hydrophobic

short, Ser+Ala-

rich

almost exclusively

Ala

Gram+ 32 Lys+Arg-rich very long, less hydrophobic

longer, Thr+Pro-

rich

almost exclusively

Ala

Page 9: Hidden unit weights in network model correlations

Prokaryotic signal peptide logos

Gram-positive bacteria

Gram-negative bacteria

Page 10: Hidden unit weights in network model correlations

Positive and negative training data: secreted versus cytoplasmic and nuclear sequences 130 YGIW_ECOLIMAKFAAVIAVMALCSAPVMAAEQGGFSGPSATQSQAGGFQGPNGSVTTVESAKSLRDDTWVTLRGNIVERISDDLYVFKD 80ASGTINVDIDHKRWNGVTVTPKDTVEIQGEVDKDWNSVEIDVKQIRKVNP 160SSSSSSSSSSSSSSSSSSSSCMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 80MMMMMMMMMMMMMMMMMMM------------------------------- 160 184 PMFA_PROMIMKLSKIALAAALVFGINSVATAENETPAPKVSSTKGEIQLKGEIVNSACGLAASSSPVIVDFSEIPTSALANLQKAGNIK 80KDIELQDCDTTVAKTATVSYTPSVVNAVNKDLASFVSGNASGAGIGLMDAGSKAVKWNTATTPVQLINGVSKIPFVAYVQ 160AESADAKVTPGEFQAVINFQVDYQ 240SSSSSSSSSSSSSSSSSSSSSSCMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 80MMMMMMMMMMMMMMMMMMM------------------------------------------------------------- 160------------------------ 324 CYSB_KLEAEMKLQQLRYIVEVVNHNLNVSSTAEGLYTSQPGISKQVRMLEDELGIQIFARSGKHLTQVTPAGQEIIRIAREVLSKVDAI 80KSVAGEHTWPDKGSLYVATTHTQARYALPGVIKGFIERYPRVSLHMHQGSPTQIAEAVSKGNADFAIATEALHLYDDLVM 160LPCYHWNRSIVVTPEHPLATKASVSIEELAQYPLVTYTFGFTGRSELDTAFNRAGLTPRIVFTATDADVIKTYVRLGLGV 240GVIASMAVDPVSDPDLVKLDANGIFSHSTTKIGFRRSTFLRSYMYDFIQRFAPHLTRDVVDTAVALRSNEDIEAMFKDIK 320LPEK 400MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 80MMMMMMMMMMMMMMMMMMM------------------------------------------------------------- 160-------------------------------------------------------------------------------- 240-------------------------------------------------------------------------------- 320---- 400 157 SBMC_ECOLIMNYEIKQEEKRTVAGFHLVGPWEQTVKKGFEQLMMWVDSKNIVPKEWVAVYYDNPDETPAEKLRCDTVVTVPGYFTLPEN 80SEGVILTEITGGQYAVAVARVVGDDFAKPWYQFFNSLLQDSAYEMLPKPCFEVYLNNGAEDGYWDIEMYVAVQPKHH 160MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 80MMMMMMMMMMMMMMMMMMM---------------------------------------------------------- 160

Page 11: Hidden unit weights in network model correlations

Data partitioning for training and test

Remove highly similar sequences from data set, where cleavage siteInformation reliably can be transferred by alignment.

A redundancy reduced data set can be used to make, say five-fold cross-validation.

The training set may ideally contain equal amounts of sequences with negative and positive examples.

Training

Test

Page 12: Hidden unit weights in network model correlations

Sliding window

Sequence: MAKFAAVIAVMALCSAPVMAAEQGGFSGPSATQSQAGGFQGPNGSVTTVES ...

Window size here is 9 (example)

Window 1: MAKFAAVIAWindow 2: AKFAAVIAVWindow 3: KFAAVIAVMWindow 4: FAAVIAVMA...Window 10: VMALCSAPV...

For signal peptide prediction typically the first 70 aa of positive and negative sequenes are used.

Page 13: Hidden unit weights in network model correlations

Graphical output from SignalP

Page 14: Hidden unit weights in network model correlations

Alternative start codon “prediction”

Page 15: Hidden unit weights in network model correlations

Symmetric and asymmetric neural network window sizes

SignalP uses two different networks for signal peptide prediction:

• Cleavage site prediction network (C-score)• Signal peptide vs. non-signal peptide discrimination network (S-score)

An asymmetric window is used for cleavage site prediction (more information are found upstream of the cleavage site (see logo))

A symmetric window is used for discrimination between signal peptide windows and mature protein windows

Page 16: Hidden unit weights in network model correlations

Neural network windows in SignalP

MAKFAAVIAVMALCSAPVMAAEQGGFSGPSATQSQAGGFQGPN

MAKFAAVIAVMALCSAPVMAAEQGGFSGPSATQSQAGGFQGPN

Asymmetric window

Symmetric window

Cleavage

Page 17: Hidden unit weights in network model correlations

Performance calculation

fntp

tp

ySensitivit

fn) fp)(tn fp)(tn fn)(tp (tp

fp · fntp · tn -

cc

tp: true positivetn: true negativefp: false positivefn: false negative

fptp

tp

ySpecificit

Page 18: Hidden unit weights in network model correlations

Optimization of window sizes

Optimization of window sizes for SignalP version 3.0

Page 19: Hidden unit weights in network model correlations

NN window sizes for SignalP 3.0

Cleavage site network

Discrimination network

Window Hidden Window Hidden

Euk 19+4 2 27 4

Gram- 11+3 2 19 3

Gram+ 21+2 0 19 3

Window sizes used in the final method

An asymmetric window is best for the cleavage site prediction,whereas symmetric windows is best for discrimination.

Page 20: Hidden unit weights in network model correlations

SignalP 3.0 architecture

...

...

I1 I2 I3

H1 H2 H3

O1

Input layer

Weights

Hidden layer

Output layer

Weights

O2

Input sequence data

I

H

I

H

Sequence composition

Window position

In addition to sequence input, composition (entire sequence) and position of the sliding window was used in the neural network of SignalP 3.0

Page 21: Hidden unit weights in network model correlations

Implementation of position neuron

RLAV = 24 IF (LET .LT. RLAV) THEN X = REAL(LET)/REAL(RLAV) ELSEIF (REAL(LET) .GT. 2.0*RLAV) THEN X = 0.0 ELSE X = 1.0 - ((REAL(LET)-RLAV)/REAL(RLAV)) ENDIF

MKLLQRGVALALLTTFTLASETALAYEQDKTYKITVLHTNDHHGHFWRNEYGEYGLAAQK

Fortran code

1

24 Position in sequence

Input to NN

0

0 48

Page 22: Hidden unit weights in network model correlations

Composition of secretory vs. non-secretory proteins

Page 23: Hidden unit weights in network model correlations

Composition weights

Page 24: Hidden unit weights in network model correlations

What is new in SignalP version 3.0!

• Data set– From SWISS-PROT rel. 40.0– Highly curated– Cleaned for spurious residues at pos. -1

• Length and composition– improves the performance significantly– Length improves both discrimination and cleavage performance– Composition improves discrimination

• D-score– Average of mean-S score and Y-max score – Better discrimination

Page 25: Hidden unit weights in network model correlations

Database annotation errors

• Some of the manually curated databases contain obvious errors that can be eliminated

• General ``SIGNAL´´ errors– Signal peptide include propeptide– Wrong signal peptide cleavage site– The secreted protein is processed by proteases– Wrong start codon used– Signal peptide of different class, ie. TAT or bacteriocin

(prokaryote)

Page 26: Hidden unit weights in network model correlations

Signal peptide or propeptide

N –

S igna l peptide

P ropeptide

M ature pro te in

Page 27: Hidden unit weights in network model correlations

Signal peptide or propeptide

Propeptide cleavage

Signal peptide cleavage

Page 28: Hidden unit weights in network model correlations

Isoelectric point calculations

Page 29: Hidden unit weights in network model correlations

Improvement by length and composition

Page 30: Hidden unit weights in network model correlations

Performance of three different SignalP versions

VersionCleavage site (Y-score) Discrimination (SP/non-SP)

Euk Gram- Gram+ Euk Gram- Gram+

SignalP1 NN 70.2 79.3 67.9 0.97 0.88 0.96

SignalP2 NN 72.4 83.4 67.4 0.97 0.90 0.96

SignalP2 HMM 69.5 81.4 64.5 0.94 0.93 0.96

SignalP3 NN 79.0 92.5 85.0 0.98 0.95 0.98

SignalP3 HMM 75.7 90.2 81.6 0.94 0.94 0.98

SignalP paper now has more than 2500 citations.

Page 31: Hidden unit weights in network model correlations

Exons and introns: discontinous protein coding regions in eukaryotes

Page 32: Hidden unit weights in network model correlations
Page 33: Hidden unit weights in network model correlations

Two ways to solve the problem

Predict splice sites (GT-donor and AG-acceptor)

or

Predict coding versus non-coding

(at least in non-UTRs)

Page 34: Hidden unit weights in network model correlations

C C T G G A C C G G G T G A

0.12 0.11 0.10

Page 35: Hidden unit weights in network model correlations

C T G G A C C G G G T G A C

0.12 0.11 0.10 0.14

Page 36: Hidden unit weights in network model correlations

T G G A C C G G G T G A C G

0.12 0.11 0.10 0.14 0.23

Page 37: Hidden unit weights in network model correlations

Splice site networks overpredict a lot

Page 38: Hidden unit weights in network model correlations

Combination of splice site and coding/non-coding networks

Page 39: Hidden unit weights in network model correlations

Combinationof splice siteand coding/non-codingnetworks

Page 40: Hidden unit weights in network model correlations

1 HUMA1ATP TACATCTTCTTTAAAGGTAAGGTTGCTCAACCA 1 HUMA1ATP CCTGAAGCTCTCCAAGGTGAGATCACCCTGACG 1 HUMACCYBA CCACACCCGCCGCCAGGTAAGCCCGGCCAGCCG 1 HUMACCYBA CGAGAAGATGACCCAGGTGAGTGGCCCGCTACC 1 HUMACTGA GCGCCCCAGACACCAGGTGAGTGGATGGCGCCG 1 HUMACTGA AGAGAAGATGACTCAGGTGAGGCTCGGCCGACG 1 HUMACTGA CACCATGAAGATCAAGGTGAGTCGAGGGGTTGG 1 HUMADAG TCTTATACTATGGCAGGTAAGTCCATACAGAAG 1 HUMALPHA CGTGGCTCTGTCCAAGGTAAGTGCTGGGCTACC 1 HUMALPI CCTGGCTCTGTCCAAGGTAAGGGCTGGGCCACC 1 HUMALPPD TGTGGCTCTGTCCAAGGTAAGTGCTGGGCTACC 1 HUMAPRTA CCTGGAGTACGGGAAGGTAAGAGGGCTGGGGTG 1 HUMCAPG GAAGGCTGCCTTCAAGGTAAGGCATGGGCATTG 1 HUMCFVII GGAGTGTCCATGGCAGGTAAGGCTTCCCCTGGC 1 HUMCP21OH CACCTTGGGCTGCAAGGTGAGAGGCTGATCTCG 1 HUMCP21OHC CACCTTGGGCTGCAAGGTGAGAGGCTGATCTCG 1 HUMCS1 GTGGCAATGGCTCCAGGTAAGCGCCCCTAAAAT 1 HUMCSFGMA AATGTTTGACCTCCAGGTAAGATGCTTCTCTCT 1 HUMCSPB AAAGACTTCCTTTAAGGTAAGACTATGCACCTG 1 HUMCSFGMA AATGTTTGACCTCCAGGTAAGATGCTTCTCTCT 1 HUMCSPB AAAGACTTCCTTTAAGGTAAGACTATGCACCTG 1 HUMCYC1A GCTACGGACACCTCAGGTGAGCGCTGGGCCGGG ... 2 HUMA1ATP CCTGGGACAGTGAATCGTAAGTATGCCTTTCAC 2 HUMA1ATP AAAATGAAGACAGAAGGTGATTCCCCAACCTGA 2 HUMA1GLY2 CGCCACCCTGGACCGGGTGAGTGCCTGGGCTAG 2 HUMA1GLY2 GAGAGTACCAGACCCGGTGAGAGCCCCCATTCC 2 HUMA1GLY2 ACCGTCTCCAGATACGGTGAGGGCCAGCCCTCA 2 HUMA1GLY2 GGGCTGTCTTTCTATGGTAGGCATGCTTAGCAG 2 HUMA1GLY2 CACCGACTGGAAAAAGGTAAACGCAAGGGATTG 2 HUMACCYBA GCGCCCCAGGCACCAGGTAGGGGAGCTGGCTGG 2 HUMACCYBA CAGCCTTCCTTCCTGGGTGAGTGGAGACTGTCT 2 HUMACCYBA CACAATGAAGATCAAGGTGGGTGTCTTTCCTGC 2 HUMACTGA TCGCGTTTCTCTGCCGGTGAGCGCCCCGCCCCG 2 HUMADAG CTTCGACAAGCCCAAAGTGAGCGCGCGCGGGGG 2 HUMADAG TGTCCAGGCCTACCAGGTGGGTCCTGTGAGAAG 2 HUMADAG CGAAGTAGTAAAAGAGGTGAGGGCCTGGGCTGG ... 11 HUMCS1 AACGCAACAGAAATCCGTGAGTGGATGCCGTCT 11 HUMGHN AACACAACAGAAATCCGTGAGTGGATGCCTTCT 52 HUMHSP90B CTCTAATGCTTCTGATGTAGGTGCTCTGGTTTC 80 HUMMETIF1 ACCTCCTGCAAGAAGAGTGAGTGTGAGGCCATC 112 HUMHSP90B ATACCAGAGTATCTCAGTGAGTATCTCCTTGGC 113 HUMHST GCGGACACCCGCGACAGTGAGTGGCGCGGCCAG 113 HUMLACTA GACATCTCCTGTGACAGTGAGTAGCCCCTATAA 151 HUMKAL2 ATCGAACCAGAGGAGTGTACGCCTGGGCCAGAT 157 HUMCS1 CACCTACCAGGAGTTTGTAAGTTCTTGGGGAAT 157 HUMGHN CACCTACCAGGAGTTTGTAAGCTCTTGGGGAAT 164 HUMALPHA CAACATGGACATTGATGTGCGACCCCCGGGCCA 622 HUMCFVII CTGATCGCGGTGCTGGGTGGGTACCACTCTCCC 636 HUMADAG CCTGGAACCAGGCTGAGTGAGTGATGGGCCTGG 895 HUMAPOCIB TCCAGCAAGGATTCAGGTTGTTGAGTGCTTGGG 970 HUMALPHA CGGGCCAAGAAAGCAGGTGGAGCTGGGGCCCGG2114 HUMAPRTA ATCGACTACATCGCAGGCGAGTGCCAGTGGCCG

Page 41: Hidden unit weights in network model correlations

Neural network weight analysis: reading frame detection

Page 42: Hidden unit weights in network model correlations

Exon-intron transistion detection units