Upload
mary-buchanan
View
232
Download
10
Tags:
Embed Size (px)
Citation preview
http://www.pdbj.org/
Haruki Nakamura
Institute for Protein Research,
Lab. Protein Informatics, Osaka University
PRAGMA 11 Workshop, Osaka Univ., on October 16, 2006
Integrated Web Service for Database Queries and Computations
Grid Web Services for Analog Queries to Protein 3D Structures
BioGrid at Osaka http://www.biogrid.jp
Started from 2002Goals:Grid Technology
Development for: Biotechnology (drug discovery) and
Medical Sciences.Leader: Shinji Shimojo
(CMC, Osaka Univ.)Government Support (M
EXT): 5yearsH. NakamuraK. YamaguchiT. Takada
H. Matsuda
T. Akiyama
S. ShimojoS. Date
Integration of Data Grid and Computing Grid
to access Integrated and Standardized Databases with High-performance Computers, from anywhere and with low cost.
Less Conceputal(Formal)
Highly Conceptual
Fine
Scale
Conceptual Level
Large
Literature
3D-3D-CordinatesCordinates
Function
Sequence
Molecular InteractionNetwork
Atom
Molecule
Organelle
Cell
Tissue
Organ
Organism
MassAnalysis
Medical
DrugData
SugarChain
Swiss-Prot
MEDLINE
KEGG
OMIM
Gene Ontology
PDB
InterPro, Pfam
MDDR
DIPGeneExpression
ChemicalStructure
PubChem, Ligand, ChEBI
DDBJ/EMBL/GenBank
H-Inv, FANTOM
GEO
Databases in Life Sciencesnearly 1,000 DBs
BioDataGrid by Hideo Matsuda at Osaka Univ.
BioGrid at Osaka http://www.biogrid.jp
Started from 2002Goals:Grid Technology
Development for: Biotechnology (drug discovery) and
Medical Sciences.Leader: Shinji Shimojo
(CMC, Osaka Univ.)Government Support (M
EXT): 5yearsH. NakamuraK. YamaguchiT. Takada
H. Matsuda
T. Akiyama
S. ShimojoS. Date
Integration of Data Grid and Computing Grid
to access Integrated and Standardized Databases with High-performance Computers, from anywhere and with low cost.
BioPfuga: Biosimulation Platform United on Grid Architecture A platform where individual application progra
ms at the different levels are united to execute a hybrid computation. In particular, BioPfuga is a platform for biosimulation.
http://www.biogrid.jp/
Nakamura et al. (2004) New Generation Computing, 22, 157-166.
Construction of BioPfuga
・ Propose and Design Communication between program components by XML description (BMSML: BioMolecular Simulation-ML) , and Creation of the APIs to handle it.
・ Divide programs into a set of many components
Different program modules are then implemented as the service programs based on the OGSA (Open Grid Service Architecture) mechanism.
http://www.biogrid.jp/
Construction of BioPfuga
・ Propose and Design Communication between program components by XML description (BMSML: BioMolecular Simulation-ML) , and Creation of the APIs to handle it.
・ Divide programs into a set of many components
Different program modules are then implemented as the service programs based on the OGSA (Open Grid Service Architecture) mechanism.
Quantum mechanics (QM) and Molecular mechanics (MM) simulation coupled
on BioPfuga
Coupled simulation on Grids
QM area
Active site ofan enzyme
Ligand
MM area
Htotal = HQM(x,x)
+ HQM/MM(x,y)
+ HMM(y,y)
Hamiltonian of total system
Calculate in QM area
Calculate in MM area
(Yonezawa et al. at SC2004)
developed by NEC Quantum Chemistry Group
Rapid computation for huge molecular systems (20,000 basis for RHF, 1,000 basis for CASSCF and MP2), with high parallel performance
Simulation of Electronic structure (1) AMOSS Ab initio Molecular Orbital Simulation for Supercomputer
developed by Yoshifumi Fukunishi at JBIRC-AIST and Haruki Nakamura at Institute for Protein Research, Osaka Univ.
Simulation of Protein-solvent structureprestoX-basic Protein Engineering Simulator eXtended –basic versin-
Simulation of Electronic structure (2) GSO-X generalized spin density function theory (DFT)
developed by Shusuke Yamanaka & Kizashi Yamaguchi, at Osaka Univ.
856 atoms, 8672 AO’s
PKC-C1B
A hybrid-QM/MM with BioPfuga on the Grid Service@SC2004
Client controller
BMSML
User@USA, Pittsburgh
@TokyoIT
Site-C
QM(DFT)
GSO-XService
GT3
GSO-X(DFT)
@CMC Osaka
MPI50CPU
Site-B
AMOSSService
GT3
AMOSS(HF)
QM(HF)
MPI30CPU
Viewer portal
Site-A
MM(MD)
prestoXService
GT3
prestoX(MPI)
MD-Grape2
@IPR Osaka
BMSML
BMSML
8CPU
8-boards
(Special purpose computing board: max 50GFLOPS)
Configuration of a computing system on a Grid environment with several different program
modules.
Portal/Client Program
Service QM(HF)
ServiceQM
(DFT)
ServiceMM
User
Communicate using SOAP
Distributed computing System
Personal user
SOAP SOAP
BioGrid at Osaka http://www.biogrid.jp
Started from 2002Goals:Grid Technology
Development for: Biotechnology (drug discovery) and
Medical Sciences.Leader: Shinji Shimojo
(CMC, Osaka Univ.)Government Support (M
EXT): 5yearsH. NakamuraK. YamaguchiT. Takada
H. Matsuda
T. Akiyama
S. ShimojoS. Date
Integration of Data Grid and Computing Grid
to access Integrated and Standardized Databases with High-performance Computers, from anywhere and with low cost.
Our final goal is Integration of
Data Grid and Computing Grid.
Portal/Client Program
Service A
ServiceC
ServiceB
User
GT GT GT
Communicate using SOAP
Distributed Computing System
Personal user
SOAP
site-A site-B site-C
GRID computing services
H. Nakamura, BioGrid2005 (March 9, 2005)
Integration of computing systems and Database query systems on a Grid
Portal/Client Program
Service A
ServiceC
ServiceB
User
GT GT GT
Communicate using SOAP
Distributed Computingand DB System
Personal user
SOAP
site-A site-B site-C
GRID computing services Database query services
H. Nakamura, BioGrid2005 (March 9, 2005)
Integration of computing systems and Database query systems on a Grid
Figure 1. A Schematic for the Proposed Portal: The portal is envisioned to be a modular web services architecture achieved by using an implementation of the Simple Object Access Protocol (SOAP http://www.w3.org/TR/soap/), that allows for seamless data exchange between the portal and all registered contributors. SOAP is a light weight protocol for exchange of information in a decentralized, distributed environment. It is an XML-based protocol, which defines a framework for representing remote procedure calls and responses. The XML wrappers basically map the individual contributing metadata format into an XML format that the data portal understands (Drinkwater et al., 2004). These services will be platform and language independent allowing other services (other portals or clients) to communicate with the data portal (see http://www.e-science.clrc.ac.uk/web/projects/dataportal/)
Proposed Portal for Multiple Databases for Protein StructuresBerman, H. et al. (2006) Structure, 14, 1211-1217.
Theoretical Model DB 1
Theoretical Model DB 2
Theoretical Model DB 3
PDB (Protein Data Bank): Protein Tertiary Structure Database
Atom species and coordinates, amino acid residues, any other experimental results with raw data, experimental methods, and the conditions.
X-ray crystallography, NMR, and Electron microscope experiments.
Protein Tertiary Structure
http://www.wwpdb.org/
Kim Henrick, Helen Berman, Haruki Nakamura
International collaboration in wwPDB
(Berman, Henrick & Nakamura (2003) Nat. Struct. Biol. 10, 980)
RutgersUniv.UCS
DNIST
PDBjEBI
RCSB
1) Curation, data processing, and registration are made by all the members, collaborating with each other.
2) We have a single data archive, which islooked after by one “archive keeper (RCSB)”.
3) Data format and new descriptions will be discussedamong the members.
4) Members are encouraged to develop their ownbrowsers, viewers, and other APIs and services.
Protein Data Bank Japan
http://www.pdbj.org/
At Institute for Protein Research, Osaka Univ. since 2001 assisted from the Institute for Bioinformatics Research and Development, Japan Science and Technology Agency (BIRD-JST).
Processed data numbers at PDBj
We process more than 30 % deposited data of the entire world, mainly from Asian and Oceania regions
0
1000
2000
3000
4000
5000
6000
7000
0
5000
10000
15000
20000
25000
30000
35000
year
Yearly wwPDB registration numberTotal wwPDB registration number
Yearly PDBj registration number
Tot
al r
egis
trat
ion
num
ber
Yea
rly
regi
stra
tion
num
ber
1972 75 80 85 90 95 2000 2005
Maintain Format Standards• PDB (conventional and flat)• PDB Exchange (mmCIF)
– Mechanism for extension based on new demands• PDBML
– Derived from mmCIF– All entries converted to XML– Automatic translation from mmCIF data files an
d dictionaries– 3-styles of translation released
(Westbrook, Ito, Nakamura, Henrick, Berman (2005) Bioinformatics, 21, 988-992)
PDBML: canonical XML description of PDB data, developed by the wwPDB.
(Westbrook et al. (2005) Bioinformatics, 21, 988-992)http://pdbml.pdb.org/schema/pdbx.xsd, http://pdbml.pdb.org/schema/pdbx-ext.xsd, ftp://ftp.pdbj.org/
→ No validation errors for more than 39,000 PDB file description.
<PDBx:atom_siteCategory> <PDBx:atom_site id="1"> <PDBx:group_PDB>ATOM</PDBx:group_PDB> <PDBx:type_symbol>N</PDBx:type_symbol> <PDBx:label_atom_id>N</PDBx:label_atom_id> <PDBx:label_comp_id>THR</PDBx:label_comp_id> <PDBx:label_asym_id>A</PDBx:label_asym_id> <PDBx:label_entity_id>1</PDBx:label_entity_id> <PDBx:label_seq_id>1</PDBx:label_seq_id> <PDBx:Cartn_x>17.047</PDBx:Cartn_x> <PDBx:Cartn_y>14.099</PDBx:Cartn_y> <PDBx:Cartn_z>3.625</PDBx:Cartn_z> <PDBx:occupancy>1.00</PDBx:occupancy> <PDBx:B_iso_or_equiv>13.79</PDBx:B_iso_or_equiv> <PDBx:auth_seq_id>1</PDBx:auth_seq_id> <PDBx:auth_comp_id>THR</PDBx:auth_comp_id> <PDBx:auth_asym_id>A</PDBx:auth_asym_id> <PDBx:auth_atom_id>N</PDBx:auth_atom_id> <PDBx:pdbx_PDB_model_num>1</PDBx:pdbx_PDB_model_num> </PDBx:atom_site>
<atom_record id="1">ATOM 1 A A 1 1 ? . THR THR N N N 17.047 14.099 3.625 1.00 13.79</atom_record>
Full-tag description
Separated file for coordinates
Examples for the atom coordinates
Applications of PDBML
Browser at PDBj with the native XML DB
Database extension and annotation
SOAP (Simple Object Access Protocol) services at PDBj
Molecular graphics viewer (jV) , which directly parses PDBML, displaying the information written in XML
(Kinoshita & Nakamura (2004) Bioinformatics, 20, 1329-1330, Westbrook et al (2005) Bioinformatics, 21, 988-992)
Wild-card "*" canbe used: *2as, 12*, *
"and" and wild-card are available.
/datablock[struct_confCategory/struct_conf/pdbx_PDB_helix_length>=“10"]/@datablockName
Search proteins, which have one or more helices,which are longer than 10.
xPSSS new facility
Both XQuery and XPath are available.
Search for all entries and helix of length with helix of length greater than or equal to 10 residues:
Queries by XQuery and XPath at our new xPSSS
Offers interactive molecular visualization with RasMol type commands (source code free).
Can be used both as Stand-alone and as applet with Java.
Any polygons defined by XML can be displayed and manipulated.
Can parse PDBML files and
display the molecules.
http://www.pdbj.org/PDBjViewer/
PDBjViewer or jV
xPSSS Database System
PDBML
PDBMLplus
Web server
XSLT processor
downloader
Loader
Archive(RCSB-PDB/MSD-EBI
/PDBj)
Native XML-DB
PDBMLplus
PDBMLplusF
download(FTP)
FTP server
Internet
DDBJSwisProt/UniProt
PIR/GenBank/KEGG/GDB/ProTherm/EzCatDB
EBI/CSA
/CATRE
S
Function/Source
Information
Get/Input Tools
CATRESData
Annotation
Data
AddInformation
Filtering &Recostructing
PDBMLplus
PDBMLplusF
xPSSS
Manual inputfrom literatures
PDF filesfor the primary
citations
Protein Molecular Surface Database, eF-site (Kinoshita & Nakamura )
Protein Dynamics Database, ProMode (Wako & Endo)
Alignment of Structural Homologues, ASH (Toh)
Encyclopedia of Protein Structures, eProtS (Ito & Nakamura )
Sequence Navigator & Structure Navigator (Standley )
Development of Secondary Databases:data mining
Superfolds of Proteins
Identify Protein Function from
・ Sequence similarity ・ Fold similarity
Compare Protein folds/architectures
DNA-binding domain of Bovine Papilloma Virus-1 E2 ( 2BOP )
RNA-binding domain of the U1A spliceosomal protein ( 1URN )
Distance Matrix
Map of the entire protein folds
Hou et al. (2003) Proc. Natl. Acad. Sci. USA, 100, 2386-2390.
498 SCOP domains
Dali (distance matrix alignment)http://www.ebi.ac.uk/dali/
1. Rapidly generates structure neighbors for any PDB ID
2. Both an HTTP and SOAP interface are available
3. Alignments are displayed in a readable format
4. 3D coordinates can be viewed or downloaded
Structure Navigator
http://www.pdbj.org/strucnavi/
N
i
d
d
d
i
eNER
2
0
0
Structural similarity is defined by the Number of Equivalent Residues (NER1)
1Standley D, Toh H, Nakamura H. Proteins 2004;57:381-391Standley, D.M. et al. (2005)
Structure Navigator Real-Time Server
Blue: QueryRed: Temp1
Standley, D.M. et al. (2005)
Real-Time Query Strategy
Query
Clustered Templates(5,097 representatives)
Nearest Cluster Standley, D.M. (2006)
Structure Navigator-RT could reply the query quickly, but ...
When we open the Web service of Structure Navigator-RT and accept many requests, our computer resource may not be enough.
We want to use large computing resource somewhere else.
We need a simple and flexible tool for this purpose. Grid Web Service
Grid Web Service using Opal
SDSC Technical Report TR-2006-5, 2006.
Opal is a toolkit for wrapping application programs as Web services, providing scheduling, Grid security, and data management.
CCGrid 2006
Grid Web Service using Opal
Operation Providersfor WSRF
Operation Providersfor Notification
GetRPProvider
SetRPProvider
QueryRPProvider
SubscribeProvider
NotificationConsumerProvider
Globus Toolkit 4 (GT4)
Application Service (WSRF)
Opal OP
Opal
Opal OPToolkit
Opal Operation Providerby W.W. Li & K. Ichikawa
(under the PRIUS project)
Opal is deployed as one of Operation Providers of Globus Toolkit 4 (GT4): Opal Operation Provider (Opal-OP)
Computing Server (PC-cluster 16 nodes)
Head Node (1 PC node):
Tomcat
Opal Operation Provider Service
Pre-WS GRAM
GT4
OpenPBS (launch pbs_server, pbs_sched, pbs_mon)
Run the Application program
Job scheduling
15 Worker nodes:
openPBS (launch pbs_mon)
Run the Application program
Portal Server
Apache
Tomcat (jsp -Input page/servlet -Opal-OP client program)
Library of GT4
Library of Operation Provider
Library of Operation Provider Service
End userSystem Architecture
@IPR, Osaka Univ.(Suita)
@CMC, Osaka Univ.(Toyonaka)
Submit Job
Request Query
LaunchLaunch
Query: 3D fold (1crn)
Structure Navigator-RT with Opal-OP on GRID
Blue: Query StructureRed: Second nearest cluster to the Query
Query structure
Conclusion
We implemented the Opal Operation Provider to run Structure Navigator-RT as a Grid Web Service, using the local scheduler, openPBS, to run our application program on the PC cluster at the Cyber-Media Center, Osaka University, through the network inside Osaka University.
http://www.pdbj.org/
・ Surface similarity
Identify Protein Function from
・ Sequence similarity ・ Fold similarity
Kinoshita, Furui, Nakamura (2002) J. Struct. Funct. Genomics 2, 9-22.Kinoshita, Nakamura (2004) Bioinformatics 20, 1329-1330.
eF-site database for Ligand recognition sites
www.pdbj.org /eF-site/
Folylpolyglutamate synthetase
query: MJ0226 free formsearch result against eF-site/ActiveSite + eF-site/mono-nucleotide (1684 entries)
Pyruvate kinase
Similarity was found here.
Function identification of hypothetical proteins Kinoshita & Nakamura (2003) Protein Science 12, 1589-1595. Kinoshita & Nakamura (2005) Protein Science, 14. 711-718.
eF-seek: Query service for search of similar molecular surfaces (-version)
ef-site.hgc.jp /eF-seek/
Kinoshita, K. & Nakamura, H. (2006)
Application of Grid to Web Service of PDBj Query for the analog structural data: Protein folds and molecular surfaces
Blue: QueryRed: Temp1
New Structure
eF-site Database
Quick response by using Grid
computing
Looking for the similar folds( Structure Navigator Real-Time serve
r )
Looking for the similar surfaces(eF-seek)
Fold Database
Query for molecular surface
Query for fold
Result
Result
Portal/Client Program
Service A
ServiceC
ServiceB
User
GT GT XMLDB
Communicate using SOAP
Distributed Computingand DB System
Personal user
site-A site-B site-C
GRID computing serviceswith Opal-OP
Database query services with XQuery/XPath
Integration of computing systems and Database query systems at PDBj
Analog Query Search
Text Query Search
・ Reiko Yamashita, Daron M. Standley (BIRD-JST / Inst. Protein Res., Osaka Univ.)
・ Kohei Ichikawa, Susumu Date, Shinji Shimojo (Cyber Media Center, Osaka Univ.)
・ Wilfred W. Li , Peter Arzberger (UCSD)
・ Hiroyuki Toh (Medical Inst. Bioregulation, Kyushu Univ)
・ Kengo Kinoshita (Inst. Med. Science, Univ. Tokyo)
・ Other PDBj members (BIRD-JST / Inst. Protein Res., Osaka Univ.)
・ PRIUS
Acknowledgements