22
Philip Couch e-Science Toward a Common Data and Command Representation for Quantum Chemistry

Toward a Common Data and Command Representation for Quantum Chemistry

  • Upload
    carys

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Toward a Common Data and Command Representation for Quantum Chemistry. Outline of e-CCP1 Project. Investigate the technological requirements for enabling effective use of Grid resources by the quantum chemistry community Middleware (Globus, Unicore, EGEE) Compute resources - PowerPoint PPT Presentation

Citation preview

Page 1: Toward a Common Data and Command Representation for Quantum Chemistry

Philip Couch

e-Science

Toward a Common Data and Command

Representation for Quantum Chemistry

Page 2: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

Outline of e-CCP1 Project

• Investigate the technological requirements for enabling effective use of Grid resources by the quantum chemistry community– Middleware (Globus, Unicore, EGEE)– Compute resources– Client tools (CoG kits)– Track and develop, where necessary, the emerging

standards in computational chemistry data and command representation (XML-based CML, CMLComp, FSAtom).

• Realise these requirements by developing some core tools that can be deployed and customised by CCP1 code developers.

• Develop GUI interfaces that will operate with a range of CCP1 codes and implement Grid functionality.

Page 3: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

Motivation

• Motivation:– The emergence of Grid technologies has provided a

generalised framework for the interoperability of computational codes.

– A common data and command representation:• Promotes appropriate data re-use• Makes data available to a wider community

– There are many existing ways to represent data – why not just convert between them (e.g. Open Babel)

• Error prone• If there are n formats, n(n-1) converters are required. A

solution is to find a common ‘middle ground’ for data (2n)

Page 4: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

Data Types

• What data could we represent?– Data/parameters

• Structures• Scalar properties• Molecular orbitals• Normal modes of vibration• Dynamics• Basis sets• Force fields• Pseudo-potentials

– Control• Energy convergence criteria• SCF steps• Mixing parameters• Mesh properties…

• Some data can be shared amongst codes, others will be code specific – semantics is important

• Some of the data will be meta-data (e.g. code used, version, method…)• Some of the data will define relationships between other data.

Page 5: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

Data Representation

• What are the existing ways of representing data?– Formats like CIF– Relational databases– XML (e.g. CML)– Objects, methods, data members (intermediate step)

• But, how do we implement the data models (how do we define our vocabulary)?– SQL– XML schema– Class interfaces (e.g. W3C IDL based DOM

recommendations)– UML

Page 6: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

Semantics and Ontology

Semantics

• Providing the meaning of vocabulary is important.– We want to ensure appropriate re-use of data.

• Semantics can be controlled by:– Annotating the data model (e.g. in XML schema <xsd:annotation>)– Links to external sources (e.g. XML dictionaries in CML)

Ontology

• An ontology can be thought of as ‘an explicit specification of concepts and the relationships between them.’

• Relationships between concepts can be expressed using the Resource Description Framework (RDF). RDF is the basis of ontology languages such as OWL and DAML+OIL.

• RDF schema specify the relationships used by the RDF and the relationships between relationships…

• An ontology helps to reduce implicit assumptions about data and their relationships.

Page 7: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

XML Representation

• XML is a strongly adopted and mature method of representing structured information

• A vast and increasing range of tools makes XML easily readable and interpretable by applications authored by different groups

• At the expense of conciseness:– XML is self describing – it carries meta-data– XML can be explicit about data

• Some methods of representing data in XML already exist (e.g. CML), for which there are many tools

Page 8: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

An Example

A geometry representation for the CH moleculeA basis set representation for the CH molecule

Page 9: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

Relationships

• How do we link the basis sets and geometries?– Could rely on implicit linking (<atom elementType=“C”…> with <basisSet

id=“C1”…>• But what happens if we want to change the rules?

– Could use attributes (<atom id=“a1” basis=“C1”…>)• But Documents could come from different sources, and don’t know about

each others attributes• Continual revision of the data model

– Could describe the relationship using RDF, or in an RDF-like manner

• RDF/n3:

• @prefix r1: <file://chGeom.xml#xpointer> .• @prefix r2: <file://chBasis.xml#xpointer> .• @prefix r3: <file://eccpRelations.html#> .

• <r2:(//basisSet[@id=“C1"])> <r3:isBasisFor> <r1:(//atom[@elementType="C"])> .

• <r2:(//basisSet[@id=“H1"])> <r3:isBasisFor> <r1:(//atom[@elementType="H"])> .

Page 10: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

Relationships

• But…• Passing text is not straight forward – can serialise RDF/n3 to RDF/XML

• RDF/XML (converted using CWM)

• <rdf:RDF xmlns:r1="file://chGeom.xml#xpointer"• xmlns:r2="file://chBasis.xml#xpointer"• xmlns:r3="file://chRelations.html#"• xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

• <rdf:Description rdf:about="r2:(//basisSet[@id=&#34;C1&#34;])">• <isBasisFor xmlns="r3:"• rdf:resource="r1:(//atom[@elementType=&#34;C&#34;])"/>• </rdf:Description>

• <rdf:Description rdf:about="r2:(//basisSet[@id=&#34;H1&#34;])">• <isBasisFor xmlns="r3:"• rdf:resource="r1:(//atom[@elementType=&#34;H&#34;])"/>• </rdf:Description>• </rdf:RDF>

Page 11: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

Other Design Considerations

• How implicit/explicit should we be

• When should we use ‘general’ or ‘grouping’ tags for data?– E.g. to take a CML-like example:

Exchange energy = -1.025771783 or<eexchange>-1.025771783</eexchange> or<scalar dictRef=“eccp:eexchange”>-1.025771783</scalar>

• To what extent should we tag data– E.g.

<basisExponents>0.0 0.0 0.0 0.0</basisExponents> or<basisExponents> <n>0.0</n> <n>0.0</n> <n>0.0></n> <n>0.0</n>

</basisExponents>

Page 12: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

Using XML

Parser

in.txt

Application

out.txt

out.xml

doc2.xml

doc3.xml

XML

doc1.xml

Meta-data

SaxDom

Native/foreignlibraries

Parser

Reading:

Convert to standard application input (e.g. use XSLT)

Read in the XML directly by using existing/writing your own native/foreign code

Writing:

Parse the standard application output and convert to XML

Write the XML directly by using existing/writing your own native/foreign code

Page 13: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

Using XML - Comments

• Comments:– Careful choice of DOM or SAX parser implementation

• DOM – potentially large overheads when used with large model instances• SAX – difficult to code when data is heavily cross referenced

– Until recently, XML support for FOTRAN has been poor. No existing native parsers and it’s difficult to write your own

• Solutions– native FORTRAN XML modules (Alberto Garcia)– FORTRAN DOM (Jon Wakelin)

– XML libraries such as libXML and Xerces could be used with appropriate wrappers

– There are mixed SAX and DOM API implementations, e.g libXML xmlTextReader

– Parsing standard output is a good option for proprietary code, but suffers from versioning

– Writing formatted data directly is error prone• FORTRAN WXML module (Alberto Garcia)• FORTRAN CML writer (Jon Wakelin)

Page 14: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

Automation

• Data models evolve with time. It is hard work to maintain code by hand. Ideally…

• CML - Java and C++ API generators• CCPN – Python API generators• Still have to worry about mapping the wrapper data structures

to the internal data structures of the application.

schema

XML

API generator

objects

API wrapper

application

Validation

?

Page 15: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

Data Modelling

• The focus is back to the data model.

• SXD is not easy to interpret, impeding a collaborative approach to data model design

• Designing is complicated by implementation decisions – it is a good idea to separate the conceptualisation and implementation

• Represent the data model in the Unified Modelling Language (UML)?

• This is a graphical notation (mainly) for expressing designs.

• Can UML express XSD implementation decisions?– Yes, through UML stereotypes (subtypes of Meta-model types)– A UML profile (collection of stereotypes) for schema design

has been developed by David Carlson

Page 16: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

UML data model

• UML equivalent to the XSD geometry and basis set data model

• UML can be represented as XMI to facilitate the communication of data models between applications.

• Hypermodel will convert XMI to XSD.

Page 17: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

Binary Data

• Some scientific data would be best stored in binary (e.g. molecular orbitals)

• Binary data could simply be pointed to by XML• But…

– Sharing binary data requires a machine independent way of storing it.

• Could use:– HDF– NetCDF– BinX/DFDL

Page 18: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

Current Status

• Drafting CML-like markup and schema for some computational chemistry data– Basis sets– Molecular orbitals– Cartesian and internal coordinates– Molecular vibrations– Job parameters– Scalar quantities

• Setup an eCCP1 Wiki for discussions (grids.ac.uk/eccp)

• Setup NeSCForge project page for code/data model development

Page 19: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

Current Status

• Developing a C API for parsing CML geometries– Linked to libXML2– Designed to scale well with xml file size (uses xmlTextReader)– Designed to be easily FORTRAN callable– Transparently reads gzipped XML files

• Python module written to read in CML1/2 molecular atom information for the CCP1 GUI

• GROWL (Grid Resources on Workstation Language), a C API for utilising current CLRC Grid portal services, is being developed.

Page 20: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

Meeting Logistics

Agenda

Monday 5th April

Time Format Location

• 10.00 – 11.10 Presentations Lecture theatre• 11.10 - 11.25 Refreshments• 11.25 – 12.35 Presentations Lecture theatre• 12.35 – 13.35 Lunch• 13.35 – 14.10 Presentations Lecture theatre• 14.10 – 15.40 Practical session and announcements Lecture theatre, training lab,

Cramond• 15.40 – 16.00 Refreshments• 16.00 – 17.30 Practical session Lecture theatre, training lab, Cramond• 19.00 Conference dinner

Tuesday 6th April

• 09.00 – 10.10 Presentations Lecture theatre• 10.10 – 10.25 Refreshments• 10.25 – 12.10 Presentations Lecture theatre• 12.10 – 13.10 Lunch• 13.10 – 14.25 Open discussions Cramond• 14.25 – 14.40 Refreshments• 14.40 – 16.00 Open discussions Cramond• 16.00 Meeting close

Page 21: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

Publication of meeting material

1. Contributed material to be published independently (e.g. NeSC technical report on CML)Meeting proceedings (summary) + presentations on webCreate a meeting CD containing presentations and proceedings

2. Contributed material to be published independentlyMeeting proceedings (technical report written by authors of contributed material), focus on existing material and decisions on way forward.Create a meeting CD, presentations, proceedings, and code?

Page 22: Toward a Common Data and Command Representation for Quantum Chemistry

Presenter Name

Facility NamePhilip Couch

e-Science

Discussion Topics

• How to construct a working group– Who could be involved?

• Process of data model refinement• Reference Implementation

– Platforms?– What are the requirements?– Man power?

• How focused should we be - data types to include, platforms…etc

• How do we reach a consensus