1
Prediction of Protein Solubility Motivation & objective • protein solubility is a major bottleneck in production of many therapeutic and industrially attractive proteins attempts of experimental solubilization often unsuccessful and expensive objective: reduce costs of proteomic studies by computational prioritization of protein sequences Hon J. 1,2,3 , Marušiak M. 3 , Martínek T. 3 , Zendulka J. 3 , Bednář D. 1,2 , Damborský J. 1,2 1 Loschmidt Laboratories, Centre for Toxic Compounds in the Environment RECETOX and Department of Experimental Biology, Faculty of Science, Masaryk University, 625 00 Brno, Czech Republic 2 International Clinical Research Center, St. Annes’s University Hospital Brno, 656 91 Brno, Czech Republic 3 IT4Innovations Centre of Excellence, Faculty of Information Technology, Brno University of Technology, 612 66 Brno, Czech Republic SoluProt predictor 36 sequence-based features: mostly amino acid content + predicted disorder, alpha-helix and beta-sheet content, sequence identity to PDB and several aggregated physico-chemical properties random forest regression model best accuracy of all existing predictors (58.2% on test set) • useful for protein prioritization implementation in Python available upon request (work in progress) Outlook novel features based on predicted tertiary structure implementation of an easy-to-use webserver experimental validation using a set of 40 putative haloalkane dehalogenases 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1 − specificity sensitivity BioTech ccSOL omics DeepSol ESPRESSO mWH PROSOII Protein−sol SOLpro SoluProt Figure 2. ROC curves of SoluProt and of other sequence-based tools measured on the test set. 0% 25% 50% 75% 100% 0% 25% 50% 75% 100% worst sequences removed success increase BioTech ccSOL omics DeepSol ESPRESSO mWH PROSOII Protein−Sol SOLpro SoluProt Figure 1. Increase of soluble sequences when using solubility prediction tools for prioritization on SoluProt test set. Removing 90% worst sequences from the test set (using SoluProt prediction) will increase the number of soluble outcomes by 49.9% in comparison to blind selection. References & acknowledgements 1. Helen M. Berman, M. J. G., & Protein Structure Initiative Network of Investigators. (2017). Protein Structure Initiative – Targettrack 2000–2017 – All Data Files [Data set]. Zenodo. DOI: 10.5281/zenodo.821654 2. Price, W. N., Handelman, S. K., Everett, J. K., Tong, S. N., Bracic, A., Luff, J. D., Hunt, J. F. (2011). Large-scale experimental studies show unexpected amino acid effects on protein expression and solubility in vivo in E. coli. Microb Inform and Exp, 1(1), 6. DOI: 10.1186/2042-5783-1-6 SoluProt training set TargetTrack database [1] more than 300,000 records from the structural genomics projects Keep sequences expressed in E. coli only matching algorithm based on keyword scoring and manual checking of protocol descriptions Determine solubility algorithm based on trial status and trial stop status Remove short sequences and sequences with undefined residues or transmembrane regions Apply PDB correction discard insoluble sequences found in current version of PDB database Split by the solubility Cluster to 25% identity Cluster to 25% identity Join into a single dataset Balance the number of negative and positive samples and equalize their sequence length distributions Training set 10,912 protein sequences SoluProt test set • based on NESG [2] dataset – set of 9,703 proteins expressed in E. coli processed with similar workflow as the training set – only without the first two steps 3,788 sequences in the fully independent test set after processing Table 1. Performance of sequence-based solubility prediction tools on SoluProt test set. AUC, the best possible accuracy and success increase when 90% worst sequences are removed from the set are presented. Tool AUC Accuracy Succ. inc. SoluProt 0.61 58.2% 49.9% PROSOII 0.60 57.2% 43.0% ESPRESSO* 0.57 55.5% 23.5% DeepSol 0.55 54.1% 30.3% Protein-Sol 0.54 53.6% 20.8% mWH 0.54 53.9% 13.5% SOLpro 0.52 52.3% 2.4% ccSOL omics 0.51 51.7% -1% BioTech 0.50 50.4% -3.4% *Property-based ESPRESSO prediction. This work was supported by ELIXIR CZ research infrastructure project (MEYS Grant No: LM2015047) including access to computing and storage facilities. Computational resources were supplied by the CESNET LM2015042 and the CERIT Scientific Cloud LM2015085, provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures”.

Hon J. , Marušiak M. , Martínek T. , Zendulka J. , Bednář ...loschmidt.chemi.muni.cz/soluprot/soluprot_poster.pdf · •Portfolio of ELIXIR CZ services •Flash talks on research

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hon J. , Marušiak M. , Martínek T. , Zendulka J. , Bednář ...loschmidt.chemi.muni.cz/soluprot/soluprot_poster.pdf · •Portfolio of ELIXIR CZ services •Flash talks on research

Prediction of Protein Solubility

Motivation & objective• protein solubility is a major bottleneck in

production of many therapeutic and industrially attractive proteins

• attempts of experimental solubilization often unsuccessful and expensive

• objective: reduce costs of proteomic studies by computational prioritization of protein sequences

Hon J.1,2,3, Marušiak M.3, Martínek T.3, Zendulka J.3, Bednář D.1,2, Damborský J.1,2

1 Loschmidt Laboratories, Centre for Toxic Compounds in the Environment RECETOX and Department of Experimental Biology, Faculty of Science, Masaryk University, 625 00 Brno, Czech Republic2 International Clinical Research Center, St. Annes’s University Hospital Brno, 656 91 Brno, Czech Republic

3 IT4Innovations Centre of Excellence, Faculty of Information Technology, Brno University of Technology, 612 66 Brno, Czech Republic

SoluProt predictor• 36 sequence-based features: mostly amino acid content +

predicted disorder, alpha-helix and beta-sheet content, sequence identity to PDB and several aggregated physico-chemical properties

• random forest regression model

• best accuracy of all existing predictors (58.2% on test set)

• useful for protein prioritization

• implementation in Python available upon request (work in progress)

Outlook• novel features based on predicted tertiary structure

• implementation of an easy-to-use webserver

• experimental validation using a set of 40 putative haloalkane dehalogenases

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

1 − specificity

sens

itivi

ty

BioTechccSOL omicsDeepSolESPRESSOmWHPROSOIIProtein−solSOLproSoluProt

Figure 2. ROC curves of SoluProt and of other sequence-based tools measured on the test set.

0%

25%

50%

75%

100%

0% 25% 50% 75% 100%

worst sequences removed

succ

ess

incr

ease

BioTechccSOL omicsDeepSol

ESPRESSOmWHPROSOII

Protein−SolSOLproSoluProt

Figure 1. Increase of soluble sequences when using solubility prediction tools for prioritization

on SoluProt test set. Removing 90% worst sequences from the test set (using SoluProt

prediction) will increase the number of soluble outcomes by 49.9% in comparison to blind

selection.

References & acknowledgements1. Helen M. Berman, M. J. G., & Protein Structure Initiative Network of Investigators. (2017). Protein Structure Initiative – Targettrack 2000–2017 –

All Data Files [Data set]. Zenodo. DOI: 10.5281/zenodo.8216542. Price, W. N., Handelman, S. K., Everett, J. K., Tong, S. N., Bracic, A., Luff , J. D., Hunt, J. F. (2011). Large-scale experimental studies show unexpected

amino acid eff ects on protein expression and solubility in vivo in E. coli. Microb Inform and Exp, 1(1), 6. DOI: 10.1186/2042-5783-1-6

SoluProt training set

TargetTrack database [1]more than 300,000 records from

the structural genomics projects

Keep sequences expressed in E. coli onlymatching algorithm based on keyword scoring and

manual checking of protocol descriptions

Determine solubilityalgorithm based on trial status and trial stop status

Remove short sequences and sequences with undefi ned residues or transmembrane regions

Apply PDB correctiondiscard insoluble sequences found in current

version of PDB database

Split by the solubility

Cluster to 25% identity Cluster to 25% identity

Join into a single dataset

Balance the number of negative and positive samples and equalize their sequence length distributions

Training set10,912 protein sequences

SoluProt test set• based on NESG [2] dataset – set of

9,703 proteins expressed in E. col i

• processed with similar workfl ow as the training set – only without the fi rst two steps

• 3,788 sequences in the fully independent test set after processing

Table 1. Performance of sequence-based solubility prediction tools on SoluProt test set. AUC, the best possible accuracy and success

increase when 90% worst sequences are removed from the set are presented.

Tool AUC Accuracy Succ. inc.

SoluProt 0.61 58.2% 49.9%

PROSOII 0.60 57.2% 43.0%

ESPRESSO* 0.57 55.5% 23.5%

DeepSol 0.55 54.1% 30.3%

Protein-Sol 0.54 53.6% 20.8%

mWH 0.54 53.9% 13.5%

SOLpro 0.52 52.3% 2.4%

ccSOL omics 0.51 51.7% -1%

BioTech 0.50 50.4% -3.4%

*Property-based ESPRESSO prediction.

This work was supported by ELIXIR CZ research infrastructure project (MEYS Grant No: LM2015047) including access to computing and storage facilities.Computational resources were supplied by the CESNET LM2015042 and the CERIT Scientifi c Cloud LM2015085, provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures”.

Contact

Conference ELIXIR CZ 2017 Třešť, 15 – 16 Nov, 2017

ELIXIR CZ

Flemingovo nám. 2166 10 Praha 6Czech [email protected]

www.elixir-czech.cz

CESNET, z.s.p.o.Masaryk University – CEITEC a CERIT-SCCharles UniversityPalacky University OlomoucInstitute of Microbiology of the CASInstitute of Biotechnology CASInstitute of Molecular Genetics of the CASBiology Centre CASUniversity Hospital at St. Anna in Brno - ICRCUniversity of Chemistry and TechnologyUniversity of South Bohemia University of West BohemiaCzech Technical University in Prague - FIT

Learn more and register at https://www.elixir-czech.cz/events

ELIXIR CZ Members

Organised with the support of MŠMT – largeinfrastructure project ELIXIR-CZ (Grant LM2015047)

OrganiserInstitute of Organic Chemistryand Biochemistry of the CAS

• Plenary lectures by Prof. H. Berman and Prof. B. Mons

• Presentations of partner infrastructures

• Portfolio of ELIXIR CZ services

• Flash talks on research undertaken by the ELIXIR CZ community

- Structural Bioinformatics - Genomics- Metabolomics- Cheminformatics- Proteomics- Clinical research