27
GPS Manual Group-based Prediction System Version 2.1.2 12/09/2012 Author: Yu Xue & Jian Ren Contact: Dr. Yu Xue, [email protected] ; Dr. Jian Ren, [email protected] The software is only free for academic research. The latest version of GPS software is available from http://gps.biocuckoo.org Copyright (c) 2004-2012. The CUCKOO Workgroup. All Rights Reserved.

GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

Group-based Prediction System Version 2.1.2 12/09/2012

Author: Yu Xue & Jian Ren

Contact: Dr. Yu Xue, [email protected] ; Dr. Jian Ren, [email protected] The software is only free for academic research.

The latest version of GPS software is available from http://gps.biocuckoo.org Copyright (c) 2004-2012. The CUCKOO Workgroup. All Rights Reserved.

Page 2: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

1

Index

INDEX ..................................................................................................................................................... 1

STATEMENT.......................................................................................................................................... 1

INTRODUCTION .................................................................................................................................. 1

DOWNLOAD & INSTALLATION ...................................................................................................... 2

PREDICTION OF KINASE-SPECIFIC PHOSPHORYLATION SITES......................................... 4

DIRECT PREDICTION .............................................................................................................................. 4 BATCH PREDICTION ............................................................................................................................... 7

ALGORITHMS AND PREDICTION PERFORMANCE ................................................................ 11

ALGORITHM DESIGN IN GPS 2.0 ......................................................................................................... 11 ALGORITHM UPDATION IN GPS 2.1 ..................................................................................................... 12 EVALUATION OF PREDICTION PERFORMANCES .................................................................................... 14

ESTIMATION OF FALSE POSITIVE PREDICTIONS AND CUT-OFF SETTING .................... 18

AN APPLICATION OF GPS 2.0: A LARGE-SCALE PREDICTION OF MAMMALIAN KINASE-SPECIFIC PHOSPHORYLATION SITES ....................................................................... 20

EXISTED RESOURCES AND ONLINE SERVICES FOR PHOSPHORYLATION .................... 22

REFERENCES ..................................................................................................................................... 23

RELEASE NOTE ................................................................................................................................. 24

Page 3: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

1

Statement

1. Implementation. The softwares of the CUCKOO Workgroup are implemented in JAVA (J2SE). Usually, both of online service and local stand-alone packages will be provided. 2. Availability. Our softwares are freely available for academic researches. For non-profit users, you can copy, distribute and use the softwares for your scientific studies. Our softwares are not free for commercial usage. 3. GPS. Previously, we used the GPS to denote our Group-based Phosphorylation Scoring algorithm. Currently, we are developing an integrated computational platform for post-translational modifications (PTMs) of proteins. We re-denote the GPS as Group-based Prediction Systems. This software is an indispensable part of GPS. 4. Usage. Our softwares are designed in an easy-to-use manner. Also, we invite you to read the manual before using the softwares. 5. Updation. Our softwares will be updated routinely based on users’ suggestions and advices. Thus, your feedback is greatly important for our future updation. Please do not hesitate to contact with us if you have any concerns. 6. Citation. Usually, the latest published articles will be shown on the software websites. We wish you could cite the article if the software has been helpful for your work. 7. Acknowledgements. The work of CUCKOO Workgroup is supported by grants from Chinese 973 project (2006CB933300, 2007CB914503, and 2007CB947401), Chinese Natural Science Foundation (30700138, 30830036, and 30721002), Chinese Academy of Sciences (KSCX1-YW-R65, KSCX2-YW-21, KSCX2-YW-R-139, and KJCX2-YW-M02), the Cultivation Fund of the Ministry of Education of China (NO. 706035), and China Postdoctoral Science Foundation (20080430100).

Page 4: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

1

Introduction

Identification of phosphorylation sites with their cognate protein kinases (PKs) is the foundation for understanding the functional dynamics and plasticity of various cellular processes. Although nearly 10 kinase-specific predictors were developed, numerous PKs were casually classified into sub-groups without a standard rule. And for large-scale predictions, the false positive rate (FPR) was also never addressed. Here we adopted a well-established rule to classify PKs with their verified sites into a hierarchical structure with four levels, including group, family, subfamily and single PK 1. Then we constructed the GPS (Group-based Prediction System, ver 2.0) software, with a modified version of GPS (Group-based Phosphorylation Scoring) algorithm 2,3. As the first stand-alone software for computational phosphorylation, GPS 2.0 was implemented in JAVA and could predict kinase-specific phosphorylation sites for 408 human PKs in hierarchy. Also, we developed a simple approach to estimate the theoretically maximal FPRs. As an application of GPS 2.0, we performed a large-scale prediction of more than 13,000 mammalian phosphorylation sites with high performances. Totally, there were 12,219 sites predicted with at least one PK, with coverage of 92.19%. In addition, we also provided a proteome-wide prediction of Aurora-B specific substrates including protein-protein interaction information. Our analysis will be a solid foundation for further experimental consideration and construction of phosphorylation networks. On Dec. 25th, 2008, we released GPS 2.1 with a novel Peptide Selection method to improve the prediction performances and robustness greatly. The newly released GPS software is only freely available for academic researches at: http://gps.biocuckoo.org.

Group-based Prediction System v2.1 User Interface

Page 5: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

2

Download & Installation

Most of the contents in manual will not be changed in GPS 2.x software. Thus, we used GPS 2.0 as the example to write the manual for you. The software of GPS 2.x was implemented in JAVA, and could be installed on Windows, Mac OS X or Linux systems. GPS 2.x distributions could be found at http://gps.biocuckoo.org/down.php. We recommend that users could download the latest release. After downloading, please double-click on the install package to begin installation. Follow the user prompts through the installation. And snapshots of the setup program are shown below:

Page 6: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

3

Click on the Finish button to complete the setup program.

Page 7: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

4

Prediction of Kinase-specific Phosphorylation

Sites

Direct Prediction

For convenience, the GPS 2.x allows users to input their protein sequences into the “TEXT form” for prediction. One or more protein sequences should be prepared in FATSA format as below: >protein1 XXXXXXXXXXXXX XXXXXXXX >protein2 XXXXXXXXXXXXXXXX… >protein3 XXXXXXXXXXXX ... Please note: All irregular words, including non-amino acid word (e.g., number) and blank, will be removed automatically. As an instance, we put rat spinophilin protein sequence as an example for GPS 2.x. Users could click on the “Example” button to access the example.

Page 8: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

5

Choose one or more kinases from the Kinase Hierarchy Tree

Choose a Threshold what you need, default is Medium.

Page 9: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

6

Click on the Submit button, then the predicted phosphorylation sites will be shown.

Then please click on the RIGHT mouse button in the form.

Page 10: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

7

You can use the “Select All” and “Copy Selected” to copy the selected results into Clipboard. Then please copy the results into a file, e.g., an EXCEL file for further consideration. Also, you can choose “Export Result” to export the results into a tab-delimited text file. If you choose the Visualization function, the given protein and its predicted sites will be visualized with DOG (Domain Graph, Version 1.0), an illustrator of protein domain structures.

Batch Prediction

We also provide an alternative approach for processing multiple protein sequences. If the file is large, the Batch Predictor will be convenient for users. The following steps show you how to use it: Put protein sequences into a file with FATSA format as below: >protein1 XXXXXXXXXXXXX XXXXXXXX >protein2 XXXXXXXXXXXXXXXX… >protein3 XXXXXXXXXXXX ... The names of proteins are necessary (the line with “>” and a protein name/accession number).

Page 11: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

8

To run the Batch Predictor just select the Batch Predictor option in the Tools menu.

Click on the Add File button and add one or more protein sequence files in your hard disk.

Page 12: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

9

The name of added files will be shown in the Sequence File List

The output directory of prediction results should also be defined. Please click on the “>>” button to specify the export file fold.

Page 13: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

10

5. Choose one or more kinases from Kinase Hierarchy Tree, and then pick a Threshold, click on the Submit button, then the Batch Predictor will begin to process all of the sequence files that have been added to the list. The results of predictions will be exported to the Prediction Export Fold, and the name of result files will be shown in the Prediction File List.

Page 14: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

11

Algorithms and Prediction Performance

Algorithm Design in GPS 2.0

To predict kinase-specific phosphorylation sites, we employed our previous GPS (Group-based Phosphorylation Scoring) method with improvement 2,3. The basic hypothesis of GPS algorithm is that if two short peptides are highly similar, they might also bear similar 3D structures and biochemical properties. Then we used the amino acid substitution matrix BLOSUM62 to calculate the similarity between two PSP (7, 7) peptides. As previously described 2,3, for two amino acids a and b, let the substitution score between them in BLOSUM62 be Score(a, b). The similarity between two PSP (7, 7) peptides (15aa) A and B is defined as:

∑≤≤

=151

])[],[(),(i

iBiAScoreBAS

If S(A, B)<0, we simply redefine it as S(A, B)=0. Given a putative PSP (7, 7) peptide, it will be compared with all known sites pairwisely to calculate the substitution scores, separately. The average value of the substitution scores is computed as the final prediction score of the given site. Given a putative PSP (7,7) peptide, we can calculate its GPS score. Then we can judge whether the given site is a potentially real phosphorylation site, under different thresholds. In previous versions (GPS 1.0 and 1.10), we hypothesized that the bona fide pattern for PK recognition and modification might be compromised by heterogeneity of multiple structural determinants with different features. Then all known phosphorylation sites are automatically partitioned into several clusters with the Markov Cluster Algorithm (MCL for short) to improve the prediction performance 2,3. However, only ~11% of the PK groups (8 out of 71) could be divided into more than one cluster with enhanced performances. Thus, the clustering method was not used in GPS 2.0.

Page 15: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

12

To improve the robustness of the prediction system globally without influencing the prediction performance significantly, we employed an alternative approach of matrix mutation.

Firstly, the BLOSUM62 was chosen as the initial matrix. The performance (Sn and Sp) of leave-one-out validation for each PK groups was calculated. Then we fixed Sp as 90% to improve Sn by matrix mutation. Although matrix mutation in other types was also valid, the method we used in this article could improve the robustness significantly.

Algorithm Updation in GPS 2.1

Recently, we found that the longer PSP(m, n) always enhanced the self-consistency performance, but the leave-one-out or n-fold cross-validation performances will first reach a maximum value and then decreased. The self-consistency performance represented the computational power of the prediction system to training data set, while the leave-one-out validation denote the robustness and stability of the software. Usually, the self-consistency performance is higher than the leave-one-out validation. However, if two performances are different greatly, the software is over-fitting with poor prediction ability to unknown data. Thus, an important and interesting question emerged as whether we can found an optimized PSP(m, n) with the highest leave-one-out performance, without influence or slightly reducing the self-consistency performance? To address the problem, we fixed Sp as 85%, 90%, and 95% to calculate the Sn value with PSP (m, n) (m=1, …, 15; n=1, …, 15), respectively. Then the final Sn value was calculated as the average Sn under Sp of 85%, 90%, and 95%. Finally, the optimized PSP (m, n) for each PK groups were decided with their maximal Sn values. We also tested

The following pseudocode show you how to get a mutation matrix:

Initialize matrix with BLOSUM62 Set the default mutation times to 0 While mutation times less than 10000 Pick an element of matrix at random to mutate Increase or decrease the value of the element Calculate score of Leave-one-out with the mutated matrix If the score increase Keep the forward mutation Else Give up the mutation Endif Score the number of mutation times Endwhile Return the mutated matrix

Page 16: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

13

many other methods to find the optimized PSP (m, n), and found this method will be more efficient to improve the prediction robustness greatly. In GPS 2.1, the leave-one-out validations were averagely increased with ~15.80%, while the self-consistency performances were slightly and a averagely decreased with ~2.38%. Taken together, we greatly reduced the difference between the leave-one-out validation and the self-consistency performance, to make GPS system more robust and stable for prediction of kinase-specific phosphorylation sites.

Page 17: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

14

Evaluation of Prediction Performances

Performance measurements To evaluate the prediction performances, four standard measurements were used, including accuracy (Ac), sensitivity (Sn), specificity (Sp) and Mathew correlation coefficient (MCC). Accuracy (Ac) represents the correct ratio between both positive (+) and negative (-) data sets, while sensitivity (Sn) and specificity (Sp) illustrate the correct prediction ratios of positive (+) and negative data (-) sets respectively. Since the number of positive data and negative data differed too much from each other, the Mathew correlation coefficient (MCC) was also included. The value of MCC ranges from -1 to 1, and a larger MCC value stands for better prediction performance. Among the data with positive hits by GPS 2.x, the real positives were defined as true positives (TP), while the others were defined as false positives (FP). Among the data with negative predictions, the real positives were defined as false negatives (FN), while the others were defined as true negatives (TN). The four measurements of sensitivity (Sn), specificity (Sp), accuracy (Ac), and Mathew correlation coefficient (MCC) were defined as below:

FNTPTPSn+

= , FPTN

TNSp+

= ,

FNTNFPTPTNTPAc

++++

= , and

)()()()()()(

FNTNFPTPFPTNFNTPFPFNTNTPMCC

+×+×+×+×−×

= .

The self-consistency, leave-one-out validation and n-fold cross-validation The self-consistency used the training positive data and negative data directly to evaluate the prediction performance, and represented the computational power of the prediction system. However, the robustness and stability of the software should also be evaluated by leave-one-out validation and n-fold cross-validation. In the leave-one-out validation, which is also called as Jack-Knife validation, each sites in the dataset was picked out in turn as an independent test sample, and all the remaining sites were regarded as training data. This process was repeated until each site was used as test data one time. In n-fold cross-validation all the (+) sites and (-) sites were combined and then divided equally into n parts, keeping the same distribution of (+) and (-) sites in each part. Then n-1 parts were merged into a training data set while the remanent part was taken as a testing data set. This process was repeated twenty times and the average performance of n-fold cross- validation was used to estimate the performance. In this work, the performances of self-consistency and leave-one-out validation were calculated for all PK groups. And the

Page 18: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

15

4-, 6-, 8-, 10-fold cross-validations were performed for 70 PK groups without less than 30 sites. All self-consistency performances of GPS 2.x have been included in Performance option in the Tools menu.

Accessing of prediction performances of GPS 2.x with related information To check the prediction performance of the 408 human PKs just select the Performance option in the Tools menu. Choose a kinase what you want check from the Kinase Hierarchy Tree.

Page 19: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

16

Then Kinase Information and Prediction Performance of the kinase are shown in the tables.

If you want get more kinase information, you can click on the hyperlinks in the table.

Page 20: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

17

The hyperlinks will access the UniProt database and show you the detailed information of the kinases.

Page 21: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

18

Estimation of False Positive Predictions and

Cut-off setting

Estimation and control of false positive prediction is the key point in large-scale predictions of kinase-specific phosphorylation sites. The false positive rate (FPR) is the proportion of negative sites that are erroneously predicted as positive hits. From our analysis, the real phosphorylation sites were only a very small part of all S/T residues in proteins. For 144 serine/threonine PK groups, the positive sites vs. the negative sites was from 1:13.2 (Other/PEK, 16 positive sites and 211 negative sites) to 1:141.2 (CAMK/CAMK1/CAMK4, 9 positive sites with 1271 negative sites), with the average number of 49.0. And for 69 tyrosine PK groups, the positive sites vs. the negative sites was from 1:1.6 (TK/Trk/TRKA, 5 positive sites with 8 negative sites) to 1:28.2 (TK/Csk, 5 positive sites and 141 negative sites), with the average number of 9.7. Thus, even a very small FPR will generate too many false positive hits. Given a data set containing all of non-phosphorylation sites, the real FPR could be easily computed. However, precise calculation of FPR is unavailable due to lack of a “gold-standard” negative data set. In order to estimate the false positive rate (FPR), we tried to construct a near-negative data set by several approaches. The first method was to generate PSP(7,7) peptides randomly. However, the frequencies of the twenty amino acids are not equal in eukaryotes. Thus, the method was not used since it could not reflect the real distributions of PSP(7,7) peptides in proteomes. Also, the negative sites could also be randomly retrieved from eukaryotic proteomes. However, this method need a large sequence file to retrieve PSP(7,7) peptides with slow speed. In this article, we chose a simple and fast method to construct the near-negative data set. Firstly, we calculated the distributions of amino acids composition in six organisms, including S. cerevisiae, S. pombe, C. elegans, D. melanogaster, M. musculus, and H. sapiens. Then we randomly generated PSP(7,7) peptides based on the real frequencies of twenty amino acids. And FPR values based on the latter two methods were very similar. By this method, we randomly generated 10,000 PSP(7,7) peptides and used GPS 2.x to estimate the theoretically maximal FPR. The process was repeated twenty times and the average value was calculated as the final FPR. Then for large-scale predictions, we defined the precision (Pr) as below:

.Pr*.PrPr

eFPRNe −

=

Here, N is the number of sites (S/T or Y) for prediction; Pre. is the number of predicted sites by GPS 2.x. Since the FPR is the theoretically maximal false positive hits, the precision (Pr) is the minimal proportion of correct predictions.

Page 22: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

19

Threshold setting was also a difficult problem. By experiences, we and others chose different thresholds for every PK groups2-12. Here we proposed that a uniform rule to choose cut-off values based on calculated FPRs. For serine/threonine kinases, the high, medium and low thresholds were established with FPRs of 2%, 6% and 10%. And for tyrosine kinases, the high, medium and low thresholds were selected with FPRs of 4%, 9% and 15%. Such thresholds will not be changed for GPS 2.x softwares. The high threshold was validated by a large-scale prediction of mammalian phosphorylation sites, with satisfying performance. And the medium threshold relaxed the stringency to be useful in small-scale experiments. Also, the low threshold reduced the Sp to improve Sn considerably to be useful in exhaustively experimental identifying all potential phosphorylation sites in substrates.

Page 23: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

20

An application of GPS 2.0: A large-scale

prediction of mammalian kinase-specific

phosphorylation sites

As an application of GPS 2.0, we performed a large-scale prediction of kinase-specific phosphorylation sites in mammalians. The high threshold of GPS 2.0 was chosen, with a FPR of 2% for serine/threonine kinases and 4% for tyrosine kinases. The predictor for budding yeast IPL1 was not used. From Phospho.ELM 6.0, there were 13,254 mammalian sites, including 9,717 S, 1,818 T and 1,719 Y sites (see Table 1). Table 1 - Data statistics of a large-scale prediction for mammalian kinase-specific sites. From Phospho.ELM 6.0, there were 13,250 sites identified in mammalians. And GPS 2.0 could predict 12,403 sites with at least one cognate kinase.

Phospho.ELM 6.0 Mammalian Predicted Coverage Total Sites 13254 12219 92.19% Pro. 4291 4071 94.87% S Sites 9717 9195 94.63% Pro. 3444 3325 96.54% T Sites 1818 1551 85.31% Pro. 1200 1048 87.33% Y Sites 1719 1473 85.69% Pro. 885 768 86.78%

These sites were experimentally identified, but the kinase information of more than 10,000 sites still remained to be annotated. We divided the data set into three groups, the known substrates of a PK for prediction (Known sub.), the known substrates of other kinases (Other’s sub), and the sites without PK information (Unknown sub.). For example, there were 306 sites experimentally identified as PKA sites in mammalians. And 1,993 sites were verified as substrates of other PKs, with 9,236 un-annotated sites. For the first group (Known sub.), the sensitivity (Sn) was calculated to depict the proportion of which we can correctly predict for the existed sites. And for the latter two groups, the precision (Pr) was calculated to estimate the minimal accuracy for large-scale predictions, respectively. For 143 serine/threonine and 69 tyrosine PK groups, the Sn values for known substrates and Pr values for unknown data were calculated, respectively. Most of the prediction results obtained satisfying performances. For example, GPS 2.0 could predict 200 of 306 known PKA sites as positive hits, with a Sn of 65.36%. And for 1,993 sites

Page 24: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

21

phosphorylated by other PKs, GPS 2.0 could predict 220 of them as positive hits, with a Pr of 81.88%, which meant at least 81.88% of the 220 predicted sites might be positive sites. Again, for 9,236 un-annotated sites, GPS 2.0 could predict 959 of them as positive sites, with a Pr of 80.74%. However, if the real positive sites were very few in the full data set, or the occurrence of real positive sites was even lower than randomly generated data, the Pr value could be very small and even lower than zero. In our analysis, there were 53 PK groups (25%% of 212 PK groups) with low performances. And the prediction results of these predictors were removed in this work. Totally, there were 12,219 sites predicted with at least one PK, with a total coverage of 92.19%. The total prediction results are available from: http://gps.biocuckoo.org/download/Large-scale_Prediction.txt.gz

Page 25: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

22

Existed resources and online services for

phosphorylation

Predictors Web Link Ref. Phosphorylation Databases Phospho.ELM http://phospho.elm.eu.org/ 13 PhosphoSite http://www.phosphosite.org/ 14 Online Services and tools <1> For General phosphorylation sites prediction DISPHOS http://core.ist.temple.edu/pred/ 15 NetPhos 2.0 http://www.cbs.dtu.dk/services/NetPhos/ 16 NetPhosYeast http://www.cbs.dtu.dk/services/NetPhosYeast/ 17 GANNPhos not available 18 <2> For kinase-specific sites prediction GPS 1.10 http://bioinformatics.lcd-ustc.org/gps_web 2,3 PPSP http://bioinformatics.lcd-ustc.org/PPSP/ 4 NetPhosK 1.0 http://www.cbs.dtu.dk/services/NetPhosK/ 5 ScanSite http://scansite.mit.edu/ 6 KinasePhos 1.0 http://kinasephos.mbc.nctu.edu.tw/ 7 KinasePhos 2.0 http://kinasephos2.mbc.nctu.edu.tw/ 12 PredPhospho http://pred.ngri.re.kr/PredPhospho.htm 8 PREDIKIN http://florey.biosci.uq.edu.au/kinsub/home.htm 9 PhoScan http://bioinfo.au.tsinghua.edu.cn/phoscan/ 10 pkaPS http://mendel.imp.ac.at/sat/pkaPS/ 11 <3> For construction of phosphorylation network NetworKIN http://networkin.info/ 19

Page 26: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

23

References

1. G. Manning, D. B. Whyte, R. Martinez et al., Science 298 (5600), 1912 (2002). 2. Y. Xue, F. Zhou, M. Zhu et al., Nucleic Acids Res 33 (Web Server issue), W184 (2005). 3. F. F. Zhou, Y. Xue, G. L. Chen et al., Biochem Biophys Res Commun 325 (4), 1443 (2004). 4. Y. Xue, A. Li, L. Wang et al., BMC Bioinformatics 7, 163 (2006). 5. N. Blom, T. Sicheritz-Ponten, R. Gupta et al., Proteomics 4 (6), 1633 (2004). 6. J. C. Obenauer, L. C. Cantley, and M. B. Yaffe, Nucleic Acids Res 31 (13), 3635 (2003). 7. H. D. Huang, T. Y. Lee, S. W. Tzeng et al., Nucleic Acids Res 33 (Web Server issue), W226 (2005). 8. J. H. Kim, J. Lee, B. Oh et al., Bioinformatics 20 (17), 3179 (2004). 9. R. I. Brinkworth, R. A. Breinl, and B. Kobe, Proc Natl Acad Sci U S A 100 (1), 74 (2003). 10. T. Li, F. Li, and X. Zhang, Proteins (2007). 11. G. Neuberger, G. Schneider, and F. Eisenhaber, Biology direct 2, 1 (2007). 12. Y. H. Wong, T. Y. Lee, H. K. Liang et al., Nucleic Acids Res 35 (Web Server issue), W588 (2007). 13. F. Diella, S. Cameron, C. Gemund et al., BMC Bioinformatics 5 (1), 79 (2004). 14. P. V. Hornbeck, I. Chabra, J. M. Kornhauser et al., Proteomics 4 (6), 1551 (2004). 15. L. M. Iakoucheva, P. Radivojac, C. J. Brown et al., Nucleic Acids Res 32 (3), 1037 (2004). 16. N. Blom, S. Gammeltoft, and S. Brunak, J Mol Biol 294 (5), 1351 (1999). 17. C. R. Ingrell, M. L. Miller, O. N. Jensen et al., Bioinformatics 23 (7), 895 (2007). 18. Y. R. Tang, Y. Z. Chen, C. A. Canchaya et al., Protein Eng Des Sel 20 (8), 405 (2007). 19. R. Linding, L. J. Jensen, G. J. Ostheimer et al., Cell 129 (7), 1415 (2007).

Page 27: GPS Manual - biocuckoogps.biocuckoo.org/download/GPS Manual.pdf · GPS. Previously, we used the GPS to denote our Groupbased Phosphorylation - Scoring algorithm. Currently, we are

GPS Manual

24

Release Note

1. Jan. 1st, 2008, the online service and the local stand-alone packages of GPS 2.0 were released. The stand-alone software of GPS 2.0 could support Windows Operating Systems.

2. Jan. 29th, 2008, a bug was found that the version 2.0 couldn’t be used under non-English Operating Systems. We fixed the bug and released the version 2.0.1 beta version.

3. Apr. 13th, 2008, The GPS 2.0.1 was released, with online service and local packages, to support three major Operating Systems, including Windows, Linux/Unix and Mac. Also, the GPS 2.0.1 manual was updated and included in the packages.

4. Jun. 13th, 2008, we fixed a display problem under Mac OS X and released the GPS 2.0.2.

5. Aug. 28th, 2008, DOG (Domain Graph) 1.0 was integrated into GPS 2.0.3 and a new function of visualizing the predicted sites was added.

6. Dec. 25th, 2008, GPS version 2.1 was released. The GPS 2.1 used a Peptide Selection method to improve the prediction performances and robustness greatly. The Acknowledgement list, contact information and citation were added into the new version.

7. Jan. 19th, 2009,GPS web server was moved to new website (http://gps.biocuckoo.org) and a new GPS logo was put into use.

8. Jul. 21th, 2009,GPS version 2.1.1 was released. Check for update function was added. DOG (Domain Graph) was updated to version 1.0.5.

9. Sep. 12th, 2012, GPS 2.1.2 was released. The output format was changed from FASTA to TAB. DOG (Domain Graph) was updated to version 2.0.