61
Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms Moscow, Russia Gos NII Genetika Moscow, Russia

Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

Embed Size (px)

Citation preview

Page 1: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

Bioinformatic Analysis of Protein Families

Daniil G. Naumoff

Laboratory of BioinformaticsState Institute for Genetics and Selection of Industrial Microorganisms

Moscow, Russia

Gos NII Genetika

Moscow, Russia

Page 2: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

The International Nucleotide Sequence Database Collaboration (INSDC)

• GenBank at NCBI: http://www.ncbi.nlm.nih.gov/Genbank/

• EMBL Nucleotide Sequence Database: http://www.ebi.ac.uk/embl/

• DNA Data Bank of Japan (DDBJ): http://www.ddbj.nig.ac.jp/

Corresponding protein databases: GenPept, UniProtKB/TrEMBL, and DDBJ

Curated protein database Swiss-Prot: http://au.expasy.org/sprot/

Three dimensional structures of proteins (3D)

PDB: http://www.pdb.org/pdb/home/home.do (database)

SCOP: http://scop.mrc-lmb.cam.ac.uk/scop/ (classification)

Page 3: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 4: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

http://www.ebi.ac.uk/embl/Services/DBStats/

http://www.genomesonline.org/gold_statistics.htm

Page 5: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

http://www.pdb.org/pdb/statistics/contentGrowthChart.do?content=total&seqid=100

Page 6: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 7: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

Search of homologues

Page 8: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

BLOSUM-62 matrix

http://www.ncbi.nlm.nih.gov/blast/html/sub_matrix.html

Page 9: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 10: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 11: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

Overprediction is annotation of sequences at a greater level of functional specificity than available evidence supports.

Page 12: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 13: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 14: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 15: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 16: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 17: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 18: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 19: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 20: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 21: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 22: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 23: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 24: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

- Select a protein- Determine the domain structure of the selected protein- Select a domain to be analyzed- Has the protein domain family been annotated in a database?- Updating of the family list or searching for homologous domains - Cheek each "atypical" sequence (probably it will be edited or removed)- Preliminary division into subfamilies- Multiple sequence alignment (consensus?)- Phylogenetic analysis- Phylogenetic tree visualization- Subfamily structure- Interfamily relationship (superfamilies, clans, etc.)- 2D and 3D analysis (prediction)

A Protein Family Analysis(http://zbio.net/bio/001/003.html)

Page 25: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 26: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

15333http://ekhidna.biocenter.helsinki.fi/sqgraph/pairsdbADDA

1108216 December 2009http://www.ebi.ac.uk/interpro/InterPro 24.0

8575http://compbio.mcs.anl.gov/puma2/cgi-bin/index.cgiPUMA2

11912October 2009http://pfam.janelia.org/Pfam 24.0

4852http://www.ncbi.nlm.nih.gov/COG/grace/shokog.cgiKOG

4872 http://www.ncbi.nlm.nih.gov/COG/grace/uni.htmlCOG

3902June 2009http://scop.mrc-lmb.cam.ac.uk/scop/SCOP 1.75

100194 Jan 2010http://www.cathdb.info/CATH 3.3

10324 Jan 2010http://www-cryst.bioc.cam.ac.uk/homstrad/HOMSTRAD

Number of families

DateAddressDatabase

Number of annotated protein domain families

15333http://ekhidna.biocenter.helsinki.fi/sqgraph/pairsdbADDA

1108216 December 2009http://www.ebi.ac.uk/interpro/InterPro 24.0

8575http://compbio.mcs.anl.gov/puma2/cgi-bin/index.cgiPUMA2

11912October 2009http://pfam.janelia.org/Pfam 24.0

4852http://www.ncbi.nlm.nih.gov/COG/grace/shokog.cgiKOG

4872 http://www.ncbi.nlm.nih.gov/COG/grace/uni.htmlCOG

3902June 2009http://scop.mrc-lmb.cam.ac.uk/scop/SCOP 1.75

100194 Jan 2010http://www.cathdb.info/CATH 3.3

10324 Jan 2010http://www-cryst.bioc.cam.ac.uk/homstrad/HOMSTRAD

Number of families

DateAddressDatabase

Number of annotated protein domain families

51,778 domain families (+ 158,798 singletons) according to Marsden RL, Lee D, Maibaum M, Yeats C, Orengo CA. Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space. Nucleic Acids Res. 2006, 34(3):1066-1080.

13,511 SMOG domains according to Sadreyev & Grishin (BMC Struct Biol, 2006)13,511 SMOG domains according to Sadreyev & Grishin (BMC Struct Biol, 2006)

Page 27: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

ADDA - Automatic Domain Decomposition Algorithmhttp://ekhidna.biocenter.helsinki.fi/sqgraph/pairsdb/form_browse

33,879 domain families (79,965 if redundant sequences were used) according to Heger A,Holm L. Exhaustive enumeration of protein domain families. J Mol Biol. 2003, 328(3):749-767.

Page 28: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 29: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 30: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 31: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 32: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 33: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms
Page 34: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

- Select a protein- Determine the domain structure of the selected protein- Select a domain to be analyzed- Has the protein domain family been annotated in a database?- Updating of the family list or searching for homologous domains - Cheek each "atypical" sequence (probably it will be edited or removed)- Preliminary division into subfamilies- Multiple sequence alignment (consensus?)- Phylogenetic analysis- Phylogenetic tree visualization- Subfamily structure- Interfamily relationship (superfamilies, clans, etc.)- 2D and 3D analysis (prediction)

A Protein Family Analysis(http://zbio.net/bio/001/003.html)

Page 35: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

Let’s use this protein as a query sequence for BLAST

Page 36: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

BLAST results (Descriptions)

E-value < 0.01 or 0.001

Page 37: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

BLAST results (Graphic overview)

Domain I Domain II Domain III

Page 38: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

GH27N GH27C

GH27N

GH27N GH27C CBM13

GH27N GH27C CBM6

GH27N GH27C CBM6 CBM13

GH27N CBM13 GH27C

NEW1 GH27N CBM13 GH27C

NEW1 GH27N GH27C

NEW2 NEW1 GH27N GH27C

GH27N GH27C NEW3 NEW2

GH27N GH27C NEW3

GH27N GH27C Dockerin

GH27N GH27C CBM1 CE1 N-terminal domain of GH27 family

C -terminal domain of GH27 family

CE1 domain of carbohydrate esterases

Carbohydrate-binding module CBM1

Carbohydrate-binding module CBM6

Carbohydrate-binding module CBM13

Dockerin I domain

Uncharacterized domain

Uncharacterized domain (NPCBM)

Uncharacterized domain

CBM13

CBM6

Dockerin

NEW1

NEW2

NEW3

CBM1

CE1

GH27C

GH27N

Domain structure of proteins of the GH27 familyaccording to Naumoff D.G. Phylogenetic analysis of α-galactosidases of the GH27 family. Molecular Biology (Engl Transl), 2004, 38(3):388-

399.PDF: http://bioinform.genetika.ru/members/Naumoff/MB2004E.pdf

Page 39: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

15333http://ekhidna.biocenter.helsinki.fi/sqgraph/pairsdbADDA

1108216 December 2009http://www.ebi.ac.uk/interpro/InterPro 24.0

8575http://compbio.mcs.anl.gov/puma2/cgi-bin/index.cgiPUMA2

11912October 2009http://pfam.janelia.org/Pfam 24.0

4852http://www.ncbi.nlm.nih.gov/COG/grace/shokog.cgiKOG

4872 http://www.ncbi.nlm.nih.gov/COG/grace/uni.htmlCOG

3902June 2009http://scop.mrc-lmb.cam.ac.uk/scop/SCOP 1.75

100194 Jan 2010http://www.cathdb.info/CATH 3.3

10324 Jan 2010http://www-cryst.bioc.cam.ac.uk/homstrad/HOMSTRAD

Number of families

DateAddressDatabase

Universal Protein Domain Databases

15333http://ekhidna.biocenter.helsinki.fi/sqgraph/pairsdbADDA

11082

Page 40: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

Databases of individual protein families(http://www.oxfordjournals.org/nar/database/subcat/3/10)

Page 41: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

Sequence Based Classification of the Carbohydrate-Active Enzymesat the CAZy server (www.cazy.org/)

• Glycoside Hydrolases (including transglycosidases) => 118 GH families (14 clans)

• Glycosyltransferases => 92 GT families

• Polysaccharide Lyases => 21 PL families

• Carbohydrate Esterases => 16 CE families

• Carbohydrate-Binding Modules => 59 CBM families

Page 42: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

Family GH72 of Glycoside Hydrolases(http://www.cazy.org/GH72.html)

Page 43: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

Multiple Sequence Alignment:

– Automatic (ClustalW or ClustalX) >50% of sequence identity only one domain no protein fragments

– Manual (BioEdit)(take into account BLAST pairwise sequence alignment!) <30% of sequence identity long insertions / deletions facultative N-terminal part

Local dissimilarities of very similar sequences:

– Local frameshift– Exon-intron structure– Stop codon

Page 44: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

BioEdit(http://www.mbio.ncsu.edu/BioEdit/bioedit.html)

Page 45: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

Phylip(http://evolution.gs.washington.edu/phylip.html)

Maximum Parsimony(ProtPars)

Distance program(Neighbor-Joining)

Page 46: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

An infile for the Phylip package programs

Page 47: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

Maximum Parsimony(protpars.exe)

from the Phylip package

Page 48: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

Phylogenetic tree visualization: TreeView program (http://taxonomy.zoology.gla.ac.uk/rod/treeview.html)

Slanted cladogramRadial

Rectangular cladogram Phylogram

Page 49: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

Subfamily criteria (for glycosidases)

1. Pairwise sequence similarity (>30% of identity)

2. Order of sequence appearance during BLAST search (members of the same subfamily always appear at the top of BLAST results)

3. Monophyletic status

Page 50: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

The maximum parsimony phylogenetic tree of family GH97

100

1000876

1000

1000

954

1000

97C1_LEIXY97C1_PRERU97C2_BACTH1000

97C1_MICDE97C2_MICDE

97C1_BACTH97C2_PRERU97C3_PRERU1000

1000925

579

97D1_CAUCR97D1_XANAX97D1_XANCA1000

97B1_MICDE97B4_BACTH

97B1_PRERU97B1_BACTH874

813

97B2_PRERU97B1_BACFR

97B3_BACTH97B2_BACFR97B2_BACTH

4311000

8091000

509

424

977

97E1_BACTH97E1_RHOBA97A1_HALMA

97A1_SALRU97A2_BACFR97A3_BACTH

1000496

97A1_PRERU97A1_PREIN

1000

97A1_BACTH97A1_TANFO

680

97A1_BACFR97A2_BACTH97A1_UNBAC895

10001000

1000

97A8_ENSEQ97A1_AZOVI1000

97A5_ENSEQ97A4_ENSEQ97A3_ENSEQ1000

97A7_ENSEQ97A6_ENSEQ

4921000

1000

678

97A1_MICDE97A1_SHEON

97A2_ENSEQ97A1_ENSEQ991

10001000

97A1_NOVAR97A1_ERYLI1000

97A1_XANAX1000866

999

558

277

782

Subfamily 97a

97A1_XANCA

Subfamily 97d

Subfamily 97e

Subfamily 97c

Subfamily 97b

-glucosidase activity [EC 3.2.1.20]

Page 51: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

The neighbor-joining phylogenetic tree of family GH97 97E1_RHOBA97E1_BACTH97C1_LEIXY97C1_PRERU97C2_BACTH97C1_MICDE97C2_MICDE97C1_BACTH97C2_PRERU97C3_PRERU97D1_CAUCR97D1_XANCA97D1_XANAX97B1_MICDE97B1_BACTH97B4_BACTH97B1_PRERU97B2_PRERU97B1_BACFR97B3_BACTH97B2_BACFR97B2_BACTH97A1_HALMA97A1_PRERU97A1_PREIN97A1_TANFO97A1_BACTH97A1_BACFR97A1_UNBAC97A2_BACTH97A1_SALRU97A2_BACFR97A3_BACTH97A1_AZOVI97A8_ENSEQ97A5_ENSEQ97A4_ENSEQ97A3_ENSEQ97A7_ENSEQ97A6_ENSEQ97A1_ERYLI97A1_NOVAR97A1_XANCA97A1_XANAX97A1_MICDE97A1_SHEON97A2_ENSEQ97A1_ENSEQ

996

991

988839

969

993646

996

991

996996

808835

996

617499

392996

951996

498

992

908996

562953

996

996401

996

996

773996

992

850

996

996975

931996

995

865

452

271

830

Subfamily 97e

Subfamily 97c

Subfamily 97d

Subfamily 97b

Subfamily 97a

[EC 3.2.1.20]

Page 52: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

The neighbor-joining phylogenetic tree of the α-galactosidase superfamily

GH31

XYLS SULSOAGL2 BACTQ

AGLU ACIACSUIS HUMANc

SUIS HUMANnLYAG HUMAN

5572

3859

89

XYLQ LACPEORF1 THEMA

ORF1 BACHAYICI ECOLIORF1 CLOAC

4270

4077

86

ORF1 CHLAUORF2 CLOPE

ORF1 MOUSEORF1 DROME

8036

43

69

ORF1 AERHYORF1 ECOLI

93

NAGA CLOPEORF1 STRPNORF1 CLOPE

3992

98

25

AGL3 STRCOAGL2 STRCO

AGAL THETHAGAL THET2 77

AGAL THEMAAGAL LEPIN 72

37

5724

AGAL VIBCHAGAL VIBPA

99

20

30

AGA2 PEDPEAGA1 PEDPE

AGAL LACPLAGAL STRMUAGL2 RUMAL

4039

AGL5 BACFRAGL6 BACFR

94

54

49

AGL6 ASPFU

AGLC ASPNGAGL2 HYPJE

6979

21

AGAL ABSCOAGL2 BIFLORAFA ECOLI

5131

39

86

AGL3 RUMALAGL7 ASPFU

99

65

AGAL PORGI

MEL2 ARATHAGAL CYATEAGAL PHAVU

10061

AGAL SACERAGAL PSEFLAGAL MICDE

9716

9

AGL1 STRCOAGL2 ASPFU

AGAL FIBSUAGAL CLOJO

6716

5

AGLB ASPNGMEL1 YEASTMELA PHACH

2733

6

AGLA ASPNGNAGA ACRSP

98

MEL1 CAEELMEL1 DROME

NAGA HUMANAGAL HUMAN

5680

46

48

AGL3 BACFRAGL2 BACFRAGL1 BACFR

10012

7

21

AGL3 HYPJEIMD ARTGO

MEL4 ARATHMEL5 ORYSA

94

AGL1 BIFLOAGAL BACHAAGL1 RUMAL

6247

49

3972

84

AGAL SULTOAGAL SULSO

93

AGL4 BACFRAGAL BIFBR76

ORF2 ARATHSTAS PISSAGALT VIGAN

6745

ORF1 ARATHSIP CICAR

SIP HORVU53

58

94

36

89

45

57

GH27

GH36C

GH36A

GH36B

GH36D

Page 53: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

Families of the α-galactosidase superfamily and family GH97

Family GH27 GH31 GH36A GH36B GH36C GH36D GH97Clan GH-D GH-D GH-D GH-D GH-D None

COG1501KOG1065

EC 2.4.1.x EC 2.4.1.x EC 2.4.1.67EC 3.2.1.22

EC 3.2.1.84

EC 2.4.1.82EC 3.2.1.49

EC 3.2.1.10EC 3.2.1.22

EC 3.2.1.88

EC 3.2.1.20

EC 3.2.1.48

EC 4.2.2.13

Molecular mechanism

Retaining Retaining Retaining Not known Not known Not known

Eukaryota: Eukaryota: Eukaryota: Eubacteria: Eukaryota: Eubacteria: Eukaryota:

Alveolata Alveolata Fungi Acidobacteria Alveolata Firmicutes Metazoa (?)

FungiEntamoebidae

Eubacteria:

Proteobacteria

Fungi Proteobacteria Eubacteria:

MetazoaEuglenozoa

Actinobacteria

Spirochaetes

Viridiplantae Acidobacteria

MycetozoaFungi

Bacteroidetes

Thermotogales

Eubacteria:Bacteroidetes

Viridiplantae

Metazoa

Firmicutes

Thermus

ActinobacteriaPlanctomycetesEubacteria:

Mycetozoa

ProteobacteriaBacteroidetes

ProteobacteriaAcidobacteria

Rhodophyta

Spirochaetes Archaea:Archaea:

Actinobacteria

Viridiplantae

CrenarchaeotaEuryarchaeota

Bacteroidetes

Eubacteria:

FibrobacteresActinobacteria

FirmicutesBacteroidetes

ProteobacteriaCyanobacteriaFirmicutesProteobacteriaSpirochaetesThermotogales

Archaea:CrenarchaeotaEuryarchaeota

COG3345 COG3345 None

Origin

KOG2366 None None

Known enzymatic activities

EC 3.2.1.22 EC 3.2.1.22 EC 3.2.1.49 EC 3.2.1.20

COG/KOG

Actinobacteria

Actinobacteria

Deinococcus

Acidobacteria

Thermus

GH-D

EC 3.2.1.22

Retaining, Inverting

Verrucomicrobia

Verrucomicrobia

Verrucomicrobia

Verrucomicrobia

Verrucomicrobia

Verrucomicrobia Acidobacteria

EC 3.2.1.94

Page 54: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

Clans of Glycoside Hydrolases

(β)3-solenoidinversion (axial orientation)28, 49GH-N

(/)6inversion (equatorial orientation)8, 48GH-M

(/)6inversion (axial orientation)15, 65GH-L

(β/)8 -barrelretention (equatorial orientation)18, 20, 85GH-K

5-fold β-propellerretention (β‑furanoside)32, 68GH-J

+βinversion (equatorial orientation)24, 46, 80GH-I

(β/)8 -barrelretention (axial orientation)13, 70, 77GH-H

inversion (axial orientation)37, 63GH-G

5-fold β-propellerinversion (equatorial orientation)43, 62GH-F

6-fold β-propellerretention (equatorial orientation)33, 34, 83, 93GH-E

(β/)8 -barrelretention (axial orientation)27, 31, 36GH-D

β-jelly rollretention (equatorial orientation)11, 12GH-C

β-jelly rollretention (equatorial orientation)7, 16GH-B

(β/)8 -barrelretention (equatorial orientation)1, 2, 5, 10, 17, 26, 30, 35, 39, 42, 50, 51, 53, 59, 72, 79, 86, 113

GH-A

Tertiary StructureOptical ConfigurationFamilies (GH)Clan

(/)6

Page 55: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

Rigden DJ. Iterative database searches demonstrate that glycoside hydrolase families 27, 31, 36, and 66 share a common evolutionary origin with family 13. FEBS Lett. 2002, 523(1-3):17‑22.

clans

GH-D

GH-H

Page 56: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

Nagano N, Porter CT, Thornton JM. The (β/α)8 glycosidases: sequence and structure analyses suggest distant evolutionary relationships. Protein Eng. 2001, 14(11):845-855.

clans: GH-H GH-A GH-K ?

Page 57: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

Screenshot of PSI Protein Classifier

D.G. Naumoff and M. Carreras. 2009. PSI Protein Classifier: a new program automatingPSI-BLAST search results. Molecular Biology (Engl Transl). V.43. N.4. P.652-664.

Page 58: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

A hierarchical classification of the (β/α)8-type glycosyl hydrolases

Page 59: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

A hierarchical structure of the -fructosidase (furanosidase) superfamily

furanosidase superfamily

GH32

GH68

GH43

GH62

GHLP

clan GH-J

clan GH-F

GH32a

GH32b

GH32c

GH32d

GH68a

GH68b

GH43a

GH43b

GH43c

GH43d

GH43e

GH43f

GH43g

Page 60: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

The Secondary Structure Prediction

– 3D-PSSM (http://www.sbg.bio.ic.ac.uk/~3dpssm/)– GOR IV (http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_gor4.html)– nnpredict (http://www.cmpharm.ucsf.edu/~nomi/nnpredict-instrucs.html)– PredictProtein (http://www.embl-heidelberg.de/predictprotein/predictprotein.html)– Hydrophobic cluster analysis (HCA)

The Tertiary Structure Prediction– The SWISS-MODEL modeling server (http://swissmodel.expasy.org/)

Page 61: Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State Institute for Genetics and Selection of Industrial Microorganisms

Phylogenetic Analysis of a Protein Family

– The first stage of a work Prediction of 3D structure and domain structure of the protein Prediction of the active center and residues for site-directed mutagenesis Prediction of the enzymatic activities– The only part of a work (bioinformatics)– The final stage of a work (interpretation of the experimental results)

Comparison of the phylogenetic trees of each domain of a certain protein will allow to reveal the protein evolutionary history, viz. the role of gene duplication, lost, fusion, and horizontal transfer.