49
Memops Data modelling and automatic code generation Edinburgh 9 September 2008

Memops Data modelling and automatic code generation Edinburgh 9 September 2008

Embed Size (px)

Citation preview

Memops

Data modelling and automatic code generation

Edinburgh 9 September 2008

Memops - main points

■ Code generation frameworkCode generation framework

■ Data access subroutine librariesData access subroutine libraries

■ Fully automatic code generation from modelFully automatic code generation from model

■ Several programming languages in parallelSeveral programming languages in parallel

■ Precise, detailed, validated dataPrecise, detailed, validated data

Memops

● IntroductionIntroduction● Code generationCode generation● Generated librariesGenerated libraries● Applications of MemopsApplications of Memops

The CCPN Project

■ CCollaborative ollaborative CComputing omputing PProject for roject for NNMRMR

■ Since 1999Since 1999

■ Unifying platform for NMR software Unifying platform for NMR software similar to CCP4 for X-ray crystallographysimilar to CCP4 for X-ray crystallography

■ Community-based, open-source, software Community-based, open-source, software developmentdevelopment

■ Code generation, data model, applications, meetingsCode generation, data model, applications, meetings

NMR Structural Biology Pipeline

SamplePreparation

NMRMachine

StructureCalculation

DataProcessing

SpectrumAnalysis

RepositoryDatabase

Slow, complex,interactive

Native Anarchy

Convert

Task1

Task2

ConvertT

ask2

Tas

k1

Task1

Convert

Task3

Conve

rt

Task3

Convert

Task3

With Data Standard

DataStandard

Convert

Task1

Convert

Task2

Task2

Tas

k1

Conve

rt

Task1

Convert

Task3

Conve

rt

Task3

Convert

Task3

Data standard - objectives

● Lossless data transfer between programsLossless data transfer between programs- different approaches and architectures- different approaches and architectures

● All data needed for pipeline softwareAll data needed for pipeline software■ Creating data, not analysing end resultsCreating data, not analysing end results■ Intermediate results neededIntermediate results needed■ Comprehensive, detailed, complexComprehensive, detailed, complex

● Completeness, integrity of changing dataCompleteness, integrity of changing data

● Precisely defined standardPrecisely defined standard■ A single central descriptionA single central description■ Validation directly against standardValidation directly against standard

■ Standard API, no stable formatStandard API, no stable format● easier to maintain as model changeseasier to maintain as model changes

■ Abstract data model Abstract data model ● Exact correspondence to APIsExact correspondence to APIs

■ API implementations for several languagesAPI implementations for several languages

■ Transparent access to XML Transparent access to XML oror DB storage DB storage

■ Complete validation of model rules and Complete validation of model rules and constraintsconstraints

CCPN approach

Memops

● IntroductionIntroduction● Code generationCode generation● Generated librariesGenerated libraries● Applications of MemopsApplications of Memops

■ Model will change over timeModel will change over time● Several parallel implementationsSeveral parallel implementations● Synchronisation between APIs and modelSynchronisation between APIs and model● Maintenance and debuggingMaintenance and debugging● Resources are limitedResources are limited

■ Automatic Code GenerationAutomatic Code Generation● Write and debug once and for allWrite and debug once and for all● Any domain, from Astrophysics to ZoologyAny domain, from Astrophysics to Zoology● Quick and simple to extend modelQuick and simple to extend model

■ E.g. Application-specific packagesE.g. Application-specific packages

Automatic Code generation

Code Generation Framework

DomainExperts

MEMOPSframework

SoftwareDevelopers

User

Docum

entationA

pplicationD

eposition

APIs

Python

Java

C

Storage

SQL

XML

Handcoded (< 1%)

UML Model

Package 1

Package 2

Package 3

Autogeneration

Wrappers

Code Generation

ObjectDomain

UML data

edit UML

MetaModelIn-Memory Model

Python objects

On-disk modelXML file

API codeSchemasMappingsetc.

Autogeneration

CCPN codeOff-the-shelffiles

CCPN generated

Legend:

Export

API generator

ModelTraverseTextWriter

ApiGenPyLanguage

PyFileApiGen

FileApiGenPyApiGenPyType

• Written in Python• Modular• Different generators share code

Memops

● IntroductionIntroduction● Code generationCode generation● Generated librariesGenerated libraries● Applications of MemopsApplications of Memops

Model features

■ PackagesPackages to subdivide model, code, and data files to subdivide model, code, and data files

■ ObjectsObjects. Unique context, compare-by-identity. Unique context, compare-by-identity

■ Complex data typesComplex data types. Different contexts, . Different contexts, compare-by-valuecompare-by-value

■ Simple data typesSimple data types, , PositiveInt, enumerations, …PositiveInt, enumerations, …

■ Attributes and linksAttributes and links::● Cardinality, frozen/modifiable, derivedCardinality, frozen/modifiable, derived● Unique/ordered collections (sets, lists, unique lists)Unique/ordered collections (sets, lists, unique lists)

■ Ad-hocAd-hoc constraintsconstraints on attributes, simple and on attributes, simple and complex datatypes, and objects.complex datatypes, and objects.

Molstructure model package

*

** *

*

1

StructureEnsemble

+ensembleId: Int+atomNamingSystem: Line+resNamingSystem: Line

+getEnsembleValidations()

Chain

+code: Line

+getChain()

Model

+serial: Int+name: Line+details: Text

Coord

+altLocationCode: Line = +x: Float+y: Float+z: Float

+bFactor: Float = 0.0+occupancy: Float = 1.0

Residue

+seqId: Int+seqCode: Int

+seqInsertCode: Line =

+getResidue()

Atom

+name: Word+elementSymbol: Word

+getAtom()+getElementSymbol()+getChemAtom()

ccp.molecule.MolSystem.Chain

ccp.molecule.MolSystem.Residue

ccp.molecule.MolSystem.Atom

ccp.molecule.ChemComp.ChemAtom

+coordChains

1*

1

1

1

1

*

1

*

11

1

*1

1

*

11

ccp.molecule.MolSystem.MolSystem

+code: Word+name: Text+keywords: Line...:

1

CCPN APIs

■ AApplication pplication PProgramming rogramming IInterfacenterface● Object orientedObject oriented● Data accessed in memory as if stored in the data Data accessed in memory as if stored in the data

modelmodel

■ Implementations come with:Implementations come with:● Integrated, transparent I/O (file or database)Integrated, transparent I/O (file or database)● Complete validity checkingComplete validity checking● Protection against casual change (data Protection against casual change (data

encapsulation) encapsulation) ● Versioning and backwards compatibilityVersioning and backwards compatibility● Event notifier systemEvent notifier system● Slot for application-specific dataSlot for application-specific data

Science code

User Interface

Utility functions

Python+XML at runtime

Python API

XML I/O codeXML I/O mappings

Data StorageXML files

User application

Data get, set. Validity check

Generic XML read/write

User data in CCPN XMLformat

What to do for which element

CCPN codeOff-the-shelfApplication codefiles

CCPN generated

Legend:

XML parser

Java+DB at runtime

CCPN code Off-the-shelfApplication code files

CCPN generated

Legend:

HQL

Science code

User Interface

Utility functions

Java API

HibernateHibernate mappings

Database

Presentation layer

Database Schema

Hibernate

Optional

Custom queries(Hibernate Query

Language)

Now Available

■ Version 2.0 just releasedVersion 2.0 just released

■ Python+XML, Java+XML, C+XML Python+XML, Java+XML, C+XML Java+DB (with Hibernate)Java+DB (with Hibernate)

■ Available under GPL licenseAvailable under GPL licensefrom Sourceforge or www.ccpn.ac.ukfrom Sourceforge or www.ccpn.ac.uk

■ CCPN Data Standard:CCPN Data Standard:● NMR, Macromolecules, LIMSNMR, Macromolecules, LIMS● 46 packages46 packages● 552 classes and data types552 classes and data types● Python+XML implementation Python+XML implementation

800,000+ lines of code800,000+ lines of code

Memops

● IntroductionIntroduction● Code generationCode generation● Generated librariesGenerated libraries● Applications of MemopsApplications of Memops

CcpNmr Suite

■ AnalysisAnalysis ● Interactive NMR analysisInteractive NMR analysis

■ FormatConverterFormatConverter● Convert between 30+ NMR and structure formatsConvert between 30+ NMR and structure formats

■ Built on top of CCPN model (Python+XML)Built on top of CCPN model (Python+XML)

■ Version 2.0 releasedVersion 2.0 released

■ Widely used in macromlecular NMRWidely used in macromlecular NMR

CcpNmr Analysis

ExtendNMR NMR pipeline

■ Integrated macromolecular NMR pipelineIntegrated macromolecular NMR pipeline- from sample to structure- from sample to structure

■ Pre-existing programs from 8 groupsPre-existing programs from 8 groups

■ In-memory conversion to internal data In-memory conversion to internal data structuresstructures

■ Integrated versions released:Integrated versions released:● ARIA (NMR structure generation)ARIA (NMR structure generation)● Bruker TOPSPIN, Manufacturers Bruker TOPSPIN, Manufacturers

processing/analysis packageprocessing/analysis package

BIOXDM

■ Software pipeline for on-synchrotron Software pipeline for on-synchrotron crystallographycrystallography● Exploit new technology (Exploit new technology ( goniometers) goniometers)● Experiment optimisation, acquisition, and on-line Experiment optimisation, acquisition, and on-line

processingprocessing

■ Independent data model, with Memops Independent data model, with Memops machinerymachinery

■ Java+DB implementation for runtime Java+DB implementation for runtime concurrent accessconcurrent access

EUROCarbDB

■ Distributed deposition database Distributed deposition database ● Glycobiology and glycomics Glycobiology and glycomics ● NMR, MS, HPLCNMR, MS, HPLC and topology and topology

■ Java. Database storage using HibernateJava. Database storage using Hibernate

■ CCPN model Java+DB implementation CCPN model Java+DB implementation slot in as-isslot in as-is

Funding acknowledgementsFunding acknowledgements

■ BBSRC CCPN grants

■ European Union grants● EXTEND-NMR, EU-NMR, NMR-Life, NMRQUAL, and

TEMBLOR contracts

■ Industry support● AstraZeneca, Dupont Pharma (now BMS), Genentech,

GlaxoSmithKline

● Peter Keller (BIOXDM) thanks Synchrotron ‘Soleil’, the Global Phasing Consortium and EU FP6 ‘BIOXHIT’

People

■ Authors: Authors: Prof. Ernest Laue, Wayne Boucher, Rasmus Fogh, Tim Prof. Ernest Laue, Wayne Boucher, Rasmus Fogh, Tim Stevens, John Ionides, Wim Vranken (EBI), Peter Keller Stevens, John Ionides, Wim Vranken (EBI), Peter Keller (Global Phasing)(Global Phasing)

■ Collaborators at U. Cambridge: Collaborators at U. Cambridge: Dan O’Donovan, Wolfgang Rieping, Alan da Silva, Darima Dan O’Donovan, Wolfgang Rieping, Alan da Silva, Darima LamazhapovaLamazhapova

■ Collaborators at EBI (MSD), Hinxton: Collaborators at EBI (MSD), Hinxton: Kim Henrick, Anne Pajon, Chris PenkettKim Henrick, Anne Pajon, Chris Penkett

■ Special thanks to: Special thanks to: Bruker Biospin GmbH (TOPSPIN), Michael Nilges (ARIA), Bruker Biospin GmbH (TOPSPIN), Michael Nilges (ARIA), Bas Leeflang (EUROCarbDB; FP6 contract RIDS-CT-2004-Bas Leeflang (EUROCarbDB; FP6 contract RIDS-CT-2004-0119501195

ENDEND

Overview

● PackagesPackages● The Implementation packageThe Implementation package

■ ObjectsObjects■ DataTypes and DataObjTypesDataTypes and DataObjTypes

● Access controlAccess control

ARIA – structure generation from NMR dataARIA – structure generation from NMR data

Custom conversionARIA Data Model

CCPNData Model

CCPNXML

Application

ARIAXML

■ ARIA importsARIA imports● Peak ListsPeak Lists● ConstraintsConstraints● SequencesSequences● Chemical shiftsChemical shifts

■ ARIA exportsARIA exports● Peak AssignmentsPeak Assignments● Filtered ConstraintsFiltered Constraints● ViolationsViolations● StructuresStructures

API functions

■ ‘‘get’ and ‘set’ get’ and ‘set’ (Attributes and links)(Attributes and links)

■ ‘‘add’ and ‘remove’ add’ and ‘remove’ (Collection attributes and links)(Collection attributes and links)

■ ‘‘sortedsorted’ (Unordered collection links)’ (Unordered collection links)■ ‘‘findFirst’ and ‘findAll’ findFirst’ and ‘findAll’ (Collection links)(Collection links)

● Simple filtering (attribute == value)Simple filtering (attribute == value)

■ create and ‘new’ create and ‘new’ (Objects)(Objects)● Normal and ‘factory function’ object creationNormal and ‘factory function’ object creation

■ delete delete (Objects)(Objects)● ‘‘Delete’ function – cascades to objects rendered invalid by deletionDelete’ function – cascades to objects rendered invalid by deletion

■ checkValid, checkAllValid checkValid, checkAllValid (Objects)(Objects)

■ API classes are strongly coupled. API classes are strongly coupled. For efficiency reasons object-to-object links are two-way.For efficiency reasons object-to-object links are two-way.

FormatConverter - The NMR Translator

CCPNData Model

Peaks Chemical shifts Acquisition parameters

XEasy NmrView XEasy NmrView Bruker Varian... ...

Generic peak converter

Generic chemical shift converter

Generic acquisition parameters converter

Processing parameters

XEasy XEasy NmrView NMRPipeAzara... ...NmrView

Fo

rmat

sp

ecif

ic r

ead

ers

Dat

a m

od

e l e

ntr

yF

orm

at s

pec

ific

wri

ters

Chemical shiftsPeaks

ExtendNMR: ARIA

■ Structure generation from macromolecular Structure generation from macromolecular NMR data, ambiguous distance constraintsNMR data, ambiguous distance constraints

■ One of two leading programsOne of two leading programs

■ Python and scripts, with CNS dynamics Python and scripts, with CNS dynamics engineengine

■ All input and output integrated to CCPN All input and output integrated to CCPN standardstandard

ARIA: CCPN object selection

ExtendNMR: Bruker TOPSPIN

■ NMR processing program of major NMR NMR processing program of major NMR instrument company instrument company

■ Java. In-memory conversion to CCPN Java. In-memory conversion to CCPN Java+XML implementationJava+XML implementation

■ CCPN output in current TOPSPIN release,CCPN output in current TOPSPIN release,Expanded in upcoming release.Expanded in upcoming release.

Data Model v. Data Format

Atom_ID elementName Bond_ID Atom_ID Bond_ID bondOrder

Relational Database :

Abstract model (UML) :

XML :<Atom ID=“AT1” elementName=“C”> <Bond ID=“BD1” bondOrder=“1.0”> <BondList> <Atom1 IDREF=“AT1”/> <Bond IDREF=“BD1”/> <Atom2 IDREF=“AT2/> . </Bond> . </BondList></Atom>

Atom BondAtom_Bond_Connect

Atom+elementName: String = C

Bond+bondOrder: Float = 1.0*

2 +bonds

+atoms

Packages

ChemElementChemComp

Molecule

MolStructure

MolSystem

memops.AccessControl

memops.Implementation

Packages

■ Partition model, code, and dataPartition model, code, and data■ Import each otherImport each other■ Can be omittedCan be omitted■ All import Implementation and All import Implementation and

AccessControlAccessControl

■ Each have a TopObjectEach have a TopObject■ No links between data from rival Topbjects No links between data from rival Topbjects

(different e(different extentsxtents of data) of data)

Root and TopObjects

ccp.molecule.Molecule.Molecule

ccp.molecule.Molecule.MolResidue

1

*

ccp.molecule.ChemComp.ChemComp

1

ccp.molecule.ChemComp.ChemAtom

ccp.molecule.ChemComp.AbstractChemAtom

+chemAtoms

1

*

ccp.molecule.ChemComp.ChemBond

+chemAtoms

*2

*

1

memops.Implementation.MemopsRoot

+name: Word = ccpProject+override: Boolean = False+currentUserId: Word = user

+newGuid()+getPackageLocator()

1

*

1

*+currentMolecule+currentChemComp

memops.Implementation.TopObject

+guid: Line

+getPackageLocator()

*

1

TopObjects

■ One in every packageOne in every package● Ultimate parent to all objects in packageUltimate parent to all objects in package

■ Have globally unique identifier (‘guid’)Have globally unique identifier (‘guid’)■ currentXyz links from rootcurrentXyz links from root■ Links can constrain links between descendantsLinks can constrain links between descendants

■ In file implementations:In file implementations:● Hold links to storage and backup locationsHold links to storage and backup locations● Live in Implementation as almost empty shellLive in Implementation as almost empty shell

Overview

● PackagesPackages● The Implementation packageThe Implementation package

■ ObjectsObjects■ DataTypes and DataObjTypesDataTypes and DataObjTypes

● Access controlAccess control

CcpNmr AnalysisCcpNmr Analysis

■ NMR Assignment ProgramNMR Assignment Program● Inspired by ANSIG and SparkyInspired by ANSIG and Sparky

● Demonstrates CCPN approachDemonstrates CCPN approach

● Modern interface and scriptingModern interface and scripting

● Scalable and extensibleScalable and extensible

■ Operating SystemsOperating Systems● Linux, Sun, SGI, OSX, WindowsLinux, Sun, SGI, OSX, Windows

■ LanguagesLanguages● PythonPython

■ Data model interactionData model interaction

■ Tk Graphical interfaceTk Graphical interface

■ ScriptingScripting

● CC■ OpenGL/Tk contoursOpenGL/Tk contours

■ Structure displayStructure display

■ Mathematical operationsMathematical operations

Implementation Package

■ Model and Code:Model and Code:● Supertypes that define all objectsSupertypes that define all objects

■ Objects Objects ■ DataTypes DataTypes ■ DataObjTypsDataObjTyps

● Basic data typesBasic data types

■ Data – how to access the real data:Data – how to access the real data:● Data location pointersData location pointers● Current-package pointersCurrent-package pointers● Implementation data are Implementation data are notnot part of the data set, and part of the data set, and

are are notnot in the database. in the database.● Represent view or session?Represent view or session?

Data Location

FileStorageObject

+isLoaded: Boolean+isModified: Boolean+isReading: Boolean+isModifiable: Boolean = True+createdBy: Word+lastUnlockedBy: Word

+setIsModifiable()+touch()+saveTo(repository)+removeFrom(repository)+save()+backup()

MemopsRoot

+name: Word = ccpProject+override: Boolean = False+currentUserId: Word = user

+newGuid()+getPackageLocator()

Repository

+name: Line+format: StorageFormat = xml+url: Url

+getFileLocation(packageName)

TopObject

+guid: Line

+getPackageLocator()

PackageLocator

+targetName: Word = any

+repositories

1

*

{ordered}

+activeRepositories

*

*

1

+backedUp +backup

*

{ordered}

+stored +repositories

* 1..*

1

*1 1

Objects and their Supertypes

DataObject

+applicationData: ApplicationData

DbMemopsRoot

DbTopObject

FileMemopsRoot

+saveModified()+saveAll()+refreshTopObjects(packageName)+backupAll()

+importData(filePath)

FileStorageObject

+isLoaded: Boolean

+isModified: Boolean+isReading: Boolean+isModifiable: Boolean = True+createdBy: Word+lastUnlockedBy: Word

+setIsModifiable()+touch()

+saveTo(repository)+removeFrom(repository)+save()+backup()

FileTopObject

+loadFrom(repository)+load()

+restore()

ImplementationObject

MemopsObject

+isDeleted: Boolean

+getExpandedKey()

MemopsRoot

+name: Word = ccpProject+override: Boolean = False+currentUserId: Word = user

+newGuid()+getPackageLocator()

TopObject

+guid: Line

+getPackageLocator()

ComplexDataType

«DataType»

+className: Word+packageName: Word+packageShortName: Word

+qualifiedName: Line+inConstructor: Boolean

+getQualifiedName()

ccp.molecule.Molecule.Molecule

ccp.molecule.Molecule.MolResidue

+topObject1

+root1

1

*

1

1*

+currentMolecule

1

*

Simple Data Types

Boolean DataType

Int DataType

Float DataType

String DataType

Line DataType

Text DataType

Long DataType

Double DataType

Word DataType

PositiveInt DataType

SingleLine DataType

NonNegativeInt DataType

Dict DataType

DateTime DataType

StringKeyDict DataType

Any DataType

Token DataType

NonNegativeFloat DataType

FloatRatio DataType

PositiveFloat DataType

SpacelessString DataType

LongWord DataType

PositiveDouble DataType

NonNegativeDouble DataType

UrlProtocol DataType

Complex Data Types

ComplexDataType«DataType»

+className: Word+packageName: Word+packageShortName: Word+qualifiedName: Line+inConstructor: Boolean

+getQualifiedName()

MemopsDataTypeObject«DataType»

+override: Boolean

+endOverride()

Url«DataType»

+protocol: UrlProtocol = file+user: Line+password: Line+host: Line+path: PathString+port: Int+dataLocation: PathString

+getDataLocation()

AppDataBoolean«DataType»

+value: Boolean

AppDataDouble«DataType»

+value: Double

AppDataFloat«DataType»

+value: Float

AppDataInt«DataType»

+value: Int

AppDataLong«DataType»

+value: Long

AppDataString«DataType»

+value: String

ApplicationData«DataType»

+application: Line+keyword: Line