Structural Bioinformatics: Databases and Analyses

Microsoft PowerPoint - Shanghai100421distr.pptOsaka University
Structural Bioinformatics: Databases and Analyses
1. Protein Data Bank Japan (PDBj) at Osaka
2. Functional prediction from 3D structures 2-1: Similar backbone structure search
2-2: Similar local functional site search
2-3: Similar molecular surface search
3. Structural modelling of protein-protein interactions: surFitsurFit, A docking server for , A docking server for protein molecular surfacesprotein molecular surfaces
4. Tutorials for 4. Tutorials for PDBjPDBj
Outline of the Talk
http://www.wwpdb.org/
E-MSD is supported by grants from the Wellcome Trust, the EU (TEMBLOR, NMRQUAL and IIMS), CCP4, the BBSRC, the MRC and EMBL.
The BMRB is supported by NIH grant LM05799 from the National Library of Medicine.
PDBj is supported by grant-in-aid from the Institute for Bioinformatics Research and Development, Japan Science and Technology Agency (BIRD-JST), and the Ministry of Education, Culture, Sports, Science and Technology (MEXT).
The RCSB PDB is supported by grants from the National Science Foundation, National Institute of General Medical Sciences, the Office of Science-Department of Energy, the National Library of Medicine, the National Cancer Institute, the National Center for Research Resources, the National Institute of Biomedical Imaging and Bioengineering, the National Institute of Neurological Disorders and Stroke, and the National Institute of Diabetes & Digestive & Kidney Diseases.
Kleywegt, G Berman, HM Markley, JL Nakamura, H
wwPDB and wwPDBAC members at IPR, Osaka on 6 Nov, 2009
International collaboration in wwPDB
Rutgers Univ.
UCSD NIST
PDBjEBI
RCSB
1) Curation, data processing, and registration are made by all the members, collaborating with each other.
2) We have a single data archive, which is looked after by one “archive keeper (RCSB)”.
3) Data format and new descriptions are discussed among the members.
4) Members are encouraged to develop their own browsers, viewers, and other APIs and services.
wwPDB FTP Traffic
33,931,967 files were downloaded from three wwPDB members (PDBj, RCSB-PDB, PDBe) during Dec. 2009
Protein Data Bank Japan
http://www.pdbj.org/
At Institute for Protein Research, Osaka Univ. since 2001 supported from the Institute for Bioinformatics Research and Development, Japan Science and Technology Agency (BIRD-JST).
Protein Data Bank Japan PDBj Main activities
1. The wwPDB management as one of the members
2. Curation and data processing for the deposited data (PDB and BMRB)
3. Data remediation and development of the format for correct description
4. Addition of experimental information from literatures
5. Development of query tools and derived databases as the web service
Japanese Korean
Members Head: Haruki Nakamura Group for Primary Annotation: Atsushi Nakagawa, Takanori Matsuura Reiko Igarashi, Yumiko Kengaku, Kanna Matsuura, Chen Minyu
Group for Development of Tools and Services: Akira R. Kinjo, Kenji Iwasaki, Hirofumi Suzuki, Reiko Yamashita, Takahiro Kudou, Yukiko Shimizu, Yasuyo Ikegawa
Group for NMR Database (BMRB-PDBj): Toshimichi Fujiwara, Hideo Akutsu, Naohiro Kobayashi, Eiichi Nakatani, Yoko Harano
Other Collaborators: Daron M. Standley (iFREC, Osaka Univ.), Kengo Kinoshita (Tohoku Univ.), Hiroyuki Toh (MIB, Kyushu Univ.), Hiroshi Wako (Waseda Univ.), Nobutoshi Ito (Tokyo Med. Dent. Univ. )
Secretary: Chisa Kamada
6000
4000
2000
0 1972 75 80 85 90 95 2000 05 2009
year
We process 25-30 % deposited data of the entire world, mainly from Asian and Oceania regions
Total 64,229 data on 24 March, 2010
Processed data numbers at PDBj and wwPDB
How are the data qualities kept? • Quality of each structure is strictly examined
when deposited. Both the depositor himself/herself and our primary annotators examine it. Without the approval, the PDBID cannot be authorized.
• Experimental information (Structure Factors/ Distance Restraints) must be deposited together with the atom coordinates.
Get Entry Data from our XML-based browser
Access to http://www.pdbj.org/
PDBID (e.g. 12as) should be input in a box and GO
12as
Summary for each PDBID
Graphic viewer: jV http://www.pdbj.org/jV/
Amino acid sequence (FASTA)
Maintain Format Standards • PDB • PDB Exchange (mmCIF)
– Mechanism for extension based on new demands • PDBML
– Derived from mmCIF – All entries converted to XML – Automatic translation from mmCIF data files
and dictionaries – 3-styles of translation released – PDBML: the representation of archival
macromolecular structure data in XML.
: :
: :
: :
Maintain Format Standards • PDB • PDB Exchange (mmCIF)
macromolecular structure data in XML.
mmCIF (macromolecular Crystallographic Information Format)
As mmCIF is a list of data items which consist of a name and value pair.
_name value
_entry.id 1GOF _cell.length_a 98.000 _cell.length_b 89.400 _cell.length_c 86.700 _cell.angle_alpha 90.00 _cell.angle_beta 117.80 _cell.angle_gamma 90.00 _symmetry.space_group_name_H-M 'C 2 '
loop_ _atom_site.label_seq_id _atom_site.group_PDB _atom_site.type_symbol _atom_site.label_atom_id _atom_site.label_comp_id _atom_site.auth_seq_id _atom_site.label_asym_id _atom_site.Cartn_x _atom_site.Cartn_y _atom_site.Cartn_z _atom_site.occupancy _atom_site.B_iso_or_equiv _atom_site.id 1 ATOM N N ALA 1 A 38.840 0.236 1.012 1.00 34.65 1 1 ATOM C CA ALA 1 A 38.356 -0.999 0.357 1.00 42.26 2 1 ATOM C C ALA 1 A 37.098 -1.547 1.056 1.00 41.25 3 1 ATOM O O ALA 1 A 36.619 -0.946 2.028 1.00 29.44 4 1 ATOM C CB ALA 1 A 39.398 -2.114 0.379 1.00 40.70 5 2 ATOM N N SER 2 A 36.610 -2.666 0.495 1.00 32.67 6 2 ATOM C CA SER 2 A 35.411 -3.244 1.202 1.00 34.90 7 2 ATOM C C SER 2 A 35.683 -4.740 1.081 1.00 38.30 8 2 ATOM O O SER 2 A 36.827 -5.147 0.747 1.00 28.59 9 2 ATOM C CB SER 2 A 34.063 -2.660 0.823 1.00 24.49 10 2 ATOM O OG SER 2 A 33.031 -3.308 1.686 1.00 20.37 11
Example of mmCIF description
macromolecular structure data in XML
General Advantages of XML description
Well-defined data Structured data description Common tools to parse, write and validate
data written in XML Easy extension of data items Active communication among different
databases Development of native XML database
PDBML: the canonical XML for PDB To design PDBML the following points were considered:
1) Use Macromolecular Crystallographic Information Format (mmCIF) as the template. _name value → <tag> content </tag>
2) For compatibility, use the name and structure of mmCIF as much as possible, because of a comprehensive dictionary of mmCIF.
(Westbrook, Ito, Nakamura, Henrick, Berman (2005) Bioinformatics, 21, 988-992) XML schema mapping
PDBML: canonical XML description of PDB data, developed by the wwPDB
(Westbrook et al. (2005) Bioinformatics, 21, 988-992) http://pdbml.pdb.org/schema/pdbx.xsd, http://pdbml.pdb.org/schema/pdbx-ext.xsd, ftp://ftp.pdbj.org/
→ No validation errors for more than 64,000 PDB file description.
<PDBx:atom_siteCategory> <PDBx:atom_site id="1">
</PDBx:atom_site>
<atom_record id="1">ATOM 1 A A 1 1 ? . THR THR N N N 17.047 14.099 3.625 1.00 13.79</atom_record>
Full-tag description
Separated file for coordinates
ATOM 1 N THR A 1 17.047 14.099 3.625 1.00 13.79 PDB-format
Example of PDBML for an atom coordinate.
Data Format update (v3.2) 1) New “SPLIT” record
Structure for Supramolecules
3) Ligand dictionary: CCD (Chemical Compound Dictionary
http://www.wwpdb.org/ccd.html)
PDBj Mine
Encyclopedia of Protein Structures, eProtS (Kinjyo, Kudo, & Ito
Molecule of the Month, MoM (Goodsell & KudoAlignment of Sequence and
Structures. MAFFTash (Kato. Toh & Standley
Homolog protein search, Sequence Navigator (Standley
Similar fold search, Structure Navigator (Standley & Toh Function Annotation from
Folds and Sequences, SeSAW (Standley
Search for Similar Surface, eF- seek (Kinoshita & Nakamura
Electron Microscopy Navigator, EM-Navi (Suzuki
Ligand Binding Site Search, GIRAF (Kinjo
Development of other Databases and Services
Protein Dynamics Database, ProMode (Wako & Endo)
Protein Folds Browser, Protein Globe (Kinjo & Standley
Protein Molecular Surface Database, eF-site (Kinoshita & Nakamura
Template: 1zsy A
Spanner model
New! Spanner Hybrid-Template Modeling http://www.pdbj.org/spanner/ Developed by Mieszko Lis (MIT), Daron M. Standley (iFREC, Osaka U), Haruki Nakamura (IPR, Osaka U)
(http://www.pdbj.org/sfas/)
SFAS: Get homologs in PDB and have alignments with 3D models
Input your e-mail address
Side-chain modeling: Combinatorial problem
Dead End Elimination (DEE)
Multiple Alignment of Sequence and Structures (by Kato. Toh & Standley
Procedure of Homology Modeling @PDBj
Homology modeling service (by Mieszko, Standley, Nakamura
Sequence: represented by a series of Characters Discrete information: ATGC-DNAPRTEIN
Query search to find similar sequence is suitable to a digital computer. (e.g.) Are there any sequences, which are similar to
my sequence?
Structure: represented by floating point numbers Analog information:
Query should be analog: (e.g.) Are there any structures, which are similar to
my structure?
Query Structure
d
i
eNER
2
0
0
Structural similarity is defined by the Number of Equivalent Residues: NERd0
Standley, D. M. et al. (2004) PROTEINS, 57,381–391. Stanldey, D. M. et al. (2005) BMC Bioinfo. 6,221. Standley, D. M. et al. (2008) Brief. Bioinfo. 9, 276-285.
Structure Navigator Search of Similar Folds
PDBID: 2dc4 165aa long Hypothetical Protein by Bagautdinov & Kunishima (RIKEN)
Sequence identity: 27% with 2fjt (Adenylyl Cyclase)
Sequence alignment
2dc4 -------MEIEVKFRVNFEDIKRKIEGL--GA-K-FFGI-EEQEDVYFE-----L-PSPK :.:.:::: :. . . : . :.: . :
2fjT SEHFVGKYEVELKFRVM--DL-TTLHEQLVAQKATAFTLNNHEKDIYLDANGQDLADQQI
2dc4 LLRVRKINNTGKSYITYKE-ILDKRNEEFYELEFEVQDPEGAIELFKRLGFKVQGVVKKR . .: .: .: : : : . : .. ::.. . :.
2fjT SMVLREMNPSGIRLWIVKGPG-A----ER-E-ASNIEDVSKVQSMLATLGYHPAFTIEKQ
2dc4 RWIYKLNNVTFELNRVEKAGDFLDIEVIT-S--NPEEGKKIIWDVARRLGLKEEDVEPKL : :: . . . .... ::: .: ..: . : :.: .::. .. ::.
2fjT RSIYFVGKFHITVDHLTGLGDFAEIAIMTDDATELDKLKAECRDFANTFGLQVDQQEPRS
2dc4 YIELIN- : .:.
2fjt YRQLLGF
Structural alignment NER (Number of Equivalent residues): 124 (75 %) RMSD (Root-mean-square deviations of Cα): 2.0 Å
Blue: 2dc4 Red: 2fjt
Pfam mapping of 2dc4 1. Structural alignment
4. Combine the alignments for 2dc4 & 2fjt
Active site of 2fjt (Adenylyl Cyclase)
2. Pfam alignment of 2dc4
3. Pfam alignment of 2fjt (Adenylyl Cyclase)
Active sites of 2dc4 & 2fjt (adenylyl cyclase)
Glu
Glu
2dc4 should have a function to hydrolyze phosphate.
SeSAW Sequence-derived Structure Alignment Weights
for identifying functional sites
• A way of comparing sequence and structure similarities between proteins
• Structural similarities measured using ASH structural alignment program
• Sequence similarities measured using position specific scoring matrices (PSSMs) from psiBLAST
(by Standley, D. M.) Standley, D. M. et al., PROTEINS (2008) 72, 1333-1351.
Protein Functional Annotation
arg 2 0
m
= + − + ∑
Score for strucutural similarity Blosum62 score Score from PSSM values
dm: distance between the Cα atom pairs in the aligned structures. do: 4A
Identification of Protein Families/Superfamilies
Superposed structures
1x42A1v96A
1y4yA1th5A
1yz6A 1j27A
1orbA1wv9A
1me5B2cwqB
Standley, D. M. et al., PROTEINS (2008) 72, 1333-1351.
(There are other 37 structures in PROTEIN 3000 with STarget values above 50.)
Superposed structures
2acjD1wj5A
1ii5A2czlA
1nm3B1wjkA
1j1iA 2dstB
2fa4B1v9wA
1udxA1wxqA
Standley, D. M. et al., PROTEINS (2008) 72, 1333-1351.
Nature (2009) 458, 1185-190
Structural superpositions allow you to see which conserved residues are close in space
SeSAW: Protein Functional Annotation
c.39: EF-hand b.69: β-propeller b.1: β-sandwich
Similar local functional site search
2CCL
Ca2+
1TXV
Ca2+
1EDH
Ca2+
with Refined Alignment Finder (By Kinjo, A. R. )
Kinjo, A. R., Nakamura, H. (2007) Biophysics 3, 75-84. Kinjo, A. R., Nakamura, H. (2009) Structure, 17, 234-246.
LBSML: Ligand Binding Site ML
Kinjo, A. R., Nakamura, H. (2007) Biophysics 3, 75-84.
Outline of the GI-IR algorithm For a given query with its own Refset-query,
1. Search for Refsets-template in RDB that have similar shapes and environments.
2. Count the number of overlapping atoms within 2 Å radius: cnt( itref , iqref )
3. Compute the GI score:
4. Refine the alignments of highly scoring hits.
Basis set of a local coordinates system: Refset, by Delaunay tetrahedron
Each LBS is composed of hundreds of Refsets, which are stored in RDB. (atom geometry, atom type, environment)
186,485 LBS
Refset- query
GIRAF method
All-against-all search of Ligand-binding sites • All PDB entries as of 2008/6/13: 51,289 entries. • All ligands except for water and large ligands with > 1000 atoms. • A ligand binding site is a set of protein atoms closer than 5 Å from the ligand: 186,485 sites (8,161,398 refsets) in total. • All-against-all comparison: ∼ 60 hours using Xeon (3.2GHz) 160-cores.
Sequence identity and structural similarity of Ligand Binding Sites are weakly correlated.
Structural motifs (total: 2959 motifs: 69,748 Ligand Binding Sites) defined with P=10-15 and at least 10 sites
TIM barrelP-loop
Rossmann fold
1II8:c.37: Rad50 ABC-ATPase 1S7N:d.108: Coenzyme-A (CoA) binding site of acetyl transferase
ATP(Phosphate) /CoA
Similarity network for 3000 motifs: Node (structural motif) is connected if a member of one node is similar to a member of the other nodes.
Cross-fold similarity of ligand binding motif
5EST: b.47: porcin pancreatic elastase (β-barrel) 1PEK: c.41: proteinase (Rossmann-like fold)
XAI/PAPF
Similarity network
Ca2+/Ca2+
2CCL: c.39: cellulosomal scaffolding protein A 1TXV: b.69: human integrin alpha-IIb
Similarity network
Summary Nearly 3000 common ligand binding motifs (with at least 10 members) were identified. Most motifs are confined within homologous (super) families. The similarity network seems biologically meaningful. From our comprehensive structural classification, many (4035 pairs of motifs) do not share the common folds.
GIRAF service is available at PDBj. Kinjo, A. R., Nakamura, H. (2009) Structure 17, 234-246.
eF-site/eF-surf/eF-seek
(Kinoshita, K. & Nakamura, H.)
Kinoshita, K. et al., Nucl. Acids Res. (2007) 35, W398-W402. Kinoshita, K. & Nakamura, H., Protein Sci (2005) 14, 711-718. Kinoshita, K. & Nakamura, H., Bioinformatics (2004) 20, 1329-1330.
eF-site database: http://ef-site.hgc.jp
Almost all PDB entries are calculated. Individual subunits are calculated.. Each model for NMR structure is calculated.
Kinoshita, K. & Nakamura, H., Bioinformatics (2004) 20, 1329-1330.
Structure Page for surface and structure browsing
Kinoshita, K. & Nakamura, H., Bioinformatics (2004) 20, 1329-1330.
Molecular surface and electrostatic potential Connolly surface (Molecular surface) by MSROLL.
Probe sphere Solvent Accessible Surface
Protein core Re-entrant surface
Ionic strength: 0.1 M
Search of the similar molecular surfaces of hypothetical proteins
E09(SAH) FAD (NADP)-(ADP-ribose)
= # of corresponding vertices # of vertices in each patch
coverage
Kinoshita, K. & Nakamura, H. (2005) Protein Science, 14. 711-718.
Similar molecular surfaces are found by graph theorem (Largest clique search) and by geometric hashing algorithm.
coverage vs. Z-score plot.
Prediction of Ligand Binding Sites: eF-seek http://ef-site.hgc.jp/eF-seek
Prediction of Functional sites by similarity search for eF-site Search for representative ligand binding sites
For the uploaded PDB-formatted file, the putative functional sites are predicted, and the assumed complex structures will be replied.
PDBID: 1uan TT1542 Termus thermopilus HB8
3D structure was determined by N. Handa, S. Kuramitsu, S. Yokoyama (RIKEN) Handa et al. (2003) Protein Science 12, 1621-1632.
• 227 residues • More than 100 homolog proteins are known,
although their 3D structures were not known. • Rossman-like fold
• 31 putative functional surfaces are converged into 13 functional sites.
• Among them, 10’th functional site is the most promising. – The highest similarity of the molecular surfaces. – The most well conserved resion
From reverse angle
Similar molecular surfaces are searched for 22,747 functional molecular surfaces in eF-site.
Conserved region
4 protein molecular surfaces are similar. 1. a.43.1.2: 1mjq, 1mjl, 1mjq with SAM 2. d.92.1.11: 1b3d, 1d5j, 1d7x, 966c with MM3, SPC, RS2, S27 3. b.47.1.2: 1f0r with 815 4. b.71.1.1: 1jdd with GLC
O
O
Ligand binding mode at the 10’th functional site.
MM3
GLC
SAM
S27
Structural alignment Sequence Identity: 27 % NER (Number of Equivalent residues): 159 (70 %) RMSD (Root-mean-square deviations of Cα): 1.9 Å
Blue: TT1542 (1uan) Red: 1q7t
Standley & Nakamura (2008) PNE, 53, 638-644.
Green: β-Octylglucoside
(BluePositive, RedNegative potential
Standley & Nakamura (2008) PNE, 53, 638-644.
NER (Number of Equivalent residues): 159 (70%) RMSD (Root-mean-square deviations of Cα): 1.9
Blue: 1uan Red: 1q7t Green: β-Octylglucoside
TT1542 should have similar function to MshB (hydrolysis of sugars).
Structural Alignment (Pfam mapping)
E. Kanamori1,2, D.M. Standley3, S. Liang3, Y. Murakami4, A.R. Kinjo5, Y. Tsuchiya6, K. Kinoshita6, H. Nakamura5
1Biomedicinal Information Research Center, Japan Biological Informatics Consortium
2Hitachi Software Engineering Co.,Ltd. 3Systems Immunology Laboratory, Immunology Frontier Research
Center, Osaka University 4National Institute of Biomedical Innovation 5Institute for Protein
Research, Osaka University 6Institute of Medical Science, University of Tokyo
surFit: A Docking server for protein molecular surfaces
Search of the Geometrical Complementarity with the weight of ET score.
Maximize F by the optimization procedure.
RID of RalGDS
Surface Docking Procedure
∑
all vertex pairs
c
Query Sequences
surFit Pipeline
Scoring and re-computation of binding-site propensities
Clustering and Refinement by energy minimization
Submission
Binding interface prediction
Inhibitor (bound)
Two docking modes (CA and CB) [crystal structure (PDB code: 3E8L)]
Chain-A
Chain-B
Chain-C
Target
T40_P45 surFit Rank
f_nat L_rmsd I_rmsdbb
for CB M01 1 0.95 0.84 0.44 high M02 1 0.76 1.79 0.64 high M03 1 0.88 0.67 0.47 high M04 1 0.45 5.34 2.06 acceptable M05 1 0.76 1.23 0.80 high for CA M06 41 0.63 5.95 1.31 medium M07 41 0.88 2.82 1.26 medium M08 41 0.83 4.55 1.00 medium M09 41 0.89 1.00 0.45 high M10 41 0.86 2.31 0.71 high
surFit models
Homologous complex info.
surFit rank 1st and 41st models were selected for CB and CA complex, respectively.
The 7th best structure among all 368 structures
M03 high quality model
M09 high quality model
Get Entry Data from our XML-based browser
Access to http://www.pdbj.org/
PDBID (e.g. 12as) should be input in a box and GO
12as
Several Information for each Entry: Structural Details
Name of the molecule(s)
Experimental method
Gene ontology information
Result of BLAST search
Sequence Navigator is used.
PDBID list of homologs
Conventional PDB format
Conventional PDB header
Several Information for each Entry: Link
RCSB-PDB, MSD-EBI
Author names & Journal
List of hit PDBIDs is displayed.
Search of Similar Sequences: Sequence Navigator
PDBID or amino-acid sequence should be input and "Find All Homologs"
Search of Similar Structures: Structure Navigator
PDBID should be input and "Start Structure Navigator"
List of hit PDBIDs is displayed.
eProtS: Encyclopedia of Protein Structures
PDBj staffs

Documents

Structural Bioinformatics: Databases and Analyses