View
35
Download
1
Category
Tags:
Preview:
DESCRIPTION
Storage and Data. Grid Middleware 6 David Groep, lecture series 2005-2006. Outline. Data management concepts metadata, logical filename, SURL, TURL, object store Protocols GridFTP, SRM RFT/FTS, FPS & scheduled transfers with GT4 (LIGO) End-to-end integrated systems SRB - PowerPoint PPT Presentation
Citation preview
Storage and Data
Grid Middleware 6
David Groep, lecture series 2005-2006
Grid Middleware VI 2
Outline
Data management concepts metadata, logical filename, SURL, TURL, object store
Protocols GridFTP, SRM RFT/FTS, FPS & scheduled transfers with GT4 (LIGO)
End-to-end integrated systems SRB
Structured data and databases OGSA-DAI
Data curation issues media migration content conversion (emulation or translation?)
Grid Middleware VI 3
Grid data management
Data in a grid need to be located replicated life-time managed accessed (sequentially and at random)
and the user does not know where the data is
Grid Middleware VI 4
Types of storage
‘File oriented’ storage cannot support content-based queries needs annotation & metadata to be useful
(note that a file system and name is a ‘type of meta-data’) most implementations can handle any-sized object
(but MSS tape systems cannot handle very small files)
Databases structured data representation supports content queries well via indexed searches good for small data objects
(with BLOBs of MBytes, not GBytes)
Grid storage structure
For file oriented storage
Grid Middleware VI 6
File storage layers (file system analogy)
Separation the storage concepts helps for both better interoperation and scalability
1. Semantic view description of data in words and phrases
2. Meta-data view describe data by attribute-value pairs (filename is also an A-V pair) like filesystems like HPFS, EXT2+, AppleFS with ‘extended attributes’
3. Object view refers to a blob of data by a meaningless handle (unique ID) e.g. in typical Unix FS’s: inode FAT: directory entry + alloc table (mixes filename and object view)
4. Physical view block devices: series of blocks on a disk, or a specific tape & offset
Grid Middleware VI 7
Storage layers (grid naming terminology)
LFN (Logical File Name) – level 2 like the filename in the traditional file system may have hierarchical structure is not directly suitable for access, as it is site independent
GUID (Globally Unique ID) – level 3 opaque handle to reference a specific data object still independent of the site GUID-LFN mapping in 1-n
SURL (Storage URL, of physical file name PFN) – level 3 SE specific reference to a file understood by the storage management interface GUID-SURL mapping is 1-n
TURL (Transfer URL) – ‘griddy level 4’ current physical location of a file inside a specific SE is transient (i.e. only exists after being returned by the SE management
interface) has a specific lifetime SURL-TURL mapping is 1-(small number, typically 1)
terminology from EDG, gLite and Globus
Grid Middleware VI 8
Data Management Services Overview
Grid Middleware VI 9
Storage concepts
using the OSG-EDG-gLite terminology …
Storage Element management interface transfer interface(s)
Catalogues File Catalogue (meta-data catalogues) Replica Catalogue (location services & indices)
Transfer Service File Placement Data Scheduler
Grid Middleware VI 10
Grid Storage Concepts: Storage Element
Storage Element responsible for manipulating files, on anything from disk to tape-
backed mass storage contains services up to the filename level the filename typically an opaque handle for files,
as a higher-level file catalogue serves the meta-data, and the same physical file will be replicated to several SEs with different local
file names SE is a site function (not a VO function)
Capabilities Storage space for files Storage Management interface (staging, pinning) Space management (reservation) Access (read/write, e.g. via gridFTP,
HTTP(s), Posix (like)) File Transfer Service (controlling influx of data
from other SEs)
Grid Middleware VI 11
Storage Element: grid transfer services
Possiblities GridFTP
de-facto standard protocol supports GSI security features: striping & parallel transfers,
third-party transfers (TPTs, like regular FTP) part of protocol issue: firewalls don’t ‘like’ open port ranges needed by FTP
(neither active nor passive)
HTTPs single port, so more firewall-friendly implementation of GSI and delegation
required (mod_gridsite) TPTs not part of protocol …
Grid Middleware VI 12
GridFTP
‘secure, robust, fast, efficient, standards based, widely accepted’ data transfer protocol
Protocol based Multiple Independent implementation can interoperate
Globus Toolkit supplies reference implementation Server, Client tools (globus-url-copy), Development Libraries
Grid Middleware VI 13
GridFTP: The Protocol
FTP protocol is defined by several IETF RFCs Start with most commonly used subset
Standard FTP: get/put etc., 3rd-party transfer
Implement standard but often unused features GSS binding, extended directory listing, simple restart
Extend in various ways, while preserving interoperability with existing servers Striped/parallel data channels, partial file, automatic & manual TCP buffer
setting, progress monitoring, extended restart
source: Bill Allcock, ANL, Overview of GT4 Data Services, 2004
Grid Middleware VI 14
GridFTP: The Protocol (cont)
Existing standards RFC 959: File Transfer Protocol RFC 2228: FTP Security Extensions RFC 2389: Feature Negotiation for the File Transfer Protocol Draft: FTP Extensions GridFTP: Protocol Extensions to FTP for the Grid
Grid Forum Recommendation GFD.20 http://www.ggf.org/documents/GWD-R/GFD-R.020.pdf
source: Bill Allcock, ANL, Overview of GT4 Data Services, 2004
Grid Middleware VI 15
Striped Server Mode Multiple nodes work together *on a single file* and act as a
single GridFTP server An underlying parallel file system allows all nodes to see the
same file system and must deliver good performance (usually the limiting factor in transfer speed) I.e., NFS does not cut it
Each node then moves (reads or writes) only the pieces of the file that it is responsible for.
This allows multiple levels of parallelism, CPU, bus, NIC, disk, etc. Critical if you want to achieve better than 1 Gbs without breaking
the bank
source: Bill Allcock, ANL, Overview of GT4 Data Services, 2004
Grid Middleware VI 16
MODE ESPAS (Listen) - returns list of host:port pairsSTOR <FileName>
MODE ESPOR (Connect) - connect to the host-port pairsRETR <FileName>
18-Nov-03
GridFTP Striped Transfer
Host Z
Host Y
Host A
Block 1
Block 5
Block 13
Block 9
Host B
Block 2
Block 6
Block 14
Block 10
Host C
Block 3
Block 7
Block 15
Block 11
Host D
Block 4
Block 8 - > Host D
Block 16
Block 12 -> Host D
Host X
Block1 -> Host A
Block 13 -> Host A
Block 9 -> Host A
Block 2 -> Host B
Block 14 -> Host B
Block 10 -> Host B
Block 3 -> Host C
Block 7 -> Host C
Block 15 -> Host C
Block 11 -> Host C
Block 16 -> Host D
Block 4 -> Host D
Block 5 -> Host A
Block 6 -> Host B
Block 8
Block 12
source: Bill Allcock, ANL, Overview of GT4 Data Services, 2004
Grid Middleware VI 17
Disk to Disk Striping PerformanceBANDWIDTH Vs STRIPING
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
0 10 20 30 40 50 60 70
Degree of Striping
Ban
dw
idth
(M
bp
s)
# Stream = 1 # Stream = 2 # Stream = 4 # Stream = 8 # Stream = 16 # Stream = 32
source: Bill Allcock, ANL, Overview of GT4 Data Services, 2004
Grid Middleware VI 18
GridFTP: Caveats
Protocol requires that the sending side do the TCP connect (possible Firewall issues) Working on V2 of the protocol
Add explicit negotiation of streams to relax the directionality requirement above(*)
Optionally adds block checksums and resends Add a unique command ID to allow pipelining of commands
Client / Server Currently, no server library, therefore Peer to Peer type apps
VERY difficult Generally needs a pre-installed server
Looking at a “dynamically installable” server
source: Bill Allcock, ANL, Overview of GT4 Data Services, 2004
(*)DG: like a kind of application-level BEEP protocol
Grid Middleware VI 19
SE transfers: random access
wide-area R/A for files is newtypically address by adding GSI to existing cluster protocols
dcap -> GSI-dcap rfio -> GSI-RFIO xrootd -> ??
One (new) OGSA-style service WS-ByteIO
Bulk interface RandomIO interface
posix-like
needs negotiation of actual transfer protocol attachment, DIME, …
Grid Middleware VI 20
SE transfer: local back-end access
backend of a grid store is not always just a disk distributed storage systems without native posix
even if posix emulation is provided, that is always slower!
for grid use, need to also provide GridFTP and a management interface: SRM
local access might be through the native protocol but the application may not know and it is usually not secure enough to run over WAN so no use for ‘non-LAN’ use by others in the grid
Grid Middleware VI 21
Storage Management (SRM)
common management interface on top of many backend storage solutions
a GGF draft standard (from the GSM-WG)
Grid Middleware VI 22
Standards for Storage Resource Management
Main concepts Allocate spaces
Get/put files from/into spaces
Pin files for a lifetime
Release files and spaces
Get files into spaces from remote sites
Manage directory structures in spaces
SRMs communicate other SRMs as peer-to-peer
Negotiate transfer protocols
No logical name space management (can come from GGF-
GFS)
source: A. Sim, CRD, LBNL 2005
Grid Middleware VI 23
SRM Functional Concepts
Manage Spaces dynamically Reservation, allocation, lifetime Release, compact Negotiation
Manage files in spaces Request to put files in spaces Request to get files from spaces Lifetime, pining of files, release of files No logical name space management (rely on GFS)
Access remote sites for files Bring files from other sites and SRMs as requested Use existing transport services (GridFTP, http, https, ftp, bbftp, …) Transfer protocol negotiation
Manage multi-file requests Manage request queues Manage caches, pre-caching (staging) when possible Manage garbage collection
Directory Management Manage directory structure in spaces Unix semantics: srmLs, srmMkdir, srmMv, srmRm, srmRmdir
Possible Grid access to/from MSS HPSS, MSS, Enstore, JasMINE, Castor
source: A. Sim, CRD, LBNL 2005
Grid Middleware VI 24
SRM Methods by the features
Core (Basic)srmChangeFileStorageType
srmExtendFileLifetimesrmGetFeatures
srmGetRequestSummarysrmGetRequestToken
srmGetSRMStorageInfosrmGetSURLMetaData
srmGetTransferProtocolssrmPrepareToGetsrmPrepareToPutsrmPutFileDone
srmPutRequestDonesrmReleaseFiles
srmStatusOfGetRequestsrmStatusOfPutRequestsrmTerminateRequest
Space managementsrmCompactSpace
srmGetSpaceMetaDatasrmGetSpaceToken
srmReleaseFilesFromSpacesrmReleaseSpacesrmReserveSpacesrmUpdateSpace
Authorization Functions
srmCheckPermissionsrmGetStatusOfReassignment
srmReassignToUsersrmSetPermission
Request Administration
srmAbortRequestedFilessrmRemoveRequestedFiles
srmResumeRequestsrmSuspendRequest
Copy Function
srmCopysrmStatusOfCopyRequest
Directory Function
srmCpsrmLs
srmMkdirsrmMvsrmRm
srmRmdirsrmStatusOfCpRequestsrmStatusOfLsRequest
source: A. Sim, CRD, LBNL 2005
Grid Middleware VI 25
SRM interactions
Data ServerGridftp Daemon
ClientDPM Daemon
SRM Daemon
1a. SRM Put
1b. Put intoRequest Database
1c. Return SRM RequestId
DPM Database
DPNS Daemon
Data ServerGridftp Daemon
Data ServerGridftp Daemon
Grid Middleware VI 26
SRM Interactions
2a. Get Request from Database
2d. Add TURL in Request
Database and Mark ‘Ready’
2c. Pick best Data Server to put data onto
Data ServerGridftp Daemon
Client
SRM Daemon
2b. Check permissions and add to NS
DPM Database
DPNS Daemon
Data ServerGridftp Daemon
Data ServerGridftp Daemon
2e.add to replica table and set status ‘Pending’
DPM Daemon
Grid Middleware VI 27
SRM Interactions
3a. SRM getRequestStatus
Data ServerGridftp Daemon
ClientDPM Daemon
SRM Daemon
3c. Return TURL
DPM Database
3b. Get TURL from Request
DPNS Daemon
Data ServerGridftp Daemon
Data ServerGridftp Daemon
Grid Middleware VI 28
SRM Interactions
Data ServerGridftp Daemon
Client
SRM Daemon DPM Database
DPNS Daemon
Data ServerGridftp Daemon
Data ServerGridftp Daemon
4a. SRM(v1) set ‘Running’
4b. Update status of request
DPM Daemon
Grid Middleware VI 29
SRM Interactions
Data ServerGridftp Daemon
ClientDPM Daemon
SRM Daemon
5. put file via Gridftp
DPM Database
DPNS Daemon
Data ServerGridftp Daemon
Data ServerGridftp Daemon
Grid Middleware VI 30
SRM Interactions
6c. Get filesize
Data ServerGridftp Daemon
Client
SRM Daemon DPM Database
DPNS Daemon
Data ServerGridftp Daemon
Data ServerGridftp Daemon
6a. SRM(v1) set Done 6e. Update status of request
6d. Update replica metadata(size/status/pintime)
6b. Notify ‘Done’
DPM Daemon
Grid Middleware VI 31
Storage infra example with SRM
graphic: Mark van de Sanden, SARA
Grid Middleware VI 32
SRM Summary
SRM is a functional definition Adaptable to different frameworks for operation (WS, WSRF, …)
Multiple implementations interoperate Permit special purpose implementations for unique products Permits interchanging one SRM product by another
SRM implementations exist and some in production use Particle Physics Data Grid Earth System Grid More coming …
Cumulative experiences SRM v3.0 specifications to complete
source: A. Sim, CRD, LBNL 2005
Grid Middleware VI 33
Replicating Data
Data on the grid may, will and should exist in multiple copies
Replicas may be temporary for the duration of the job opportunistically stored on cheap but unreliable storage contain output cached near a compute site for later scheduled
replication
Replicas may also provide redundancy application level instead of site-local RAID or backup
Grid Middleware VI 34
Replication issues
Replicas are difficult to manage if the data is modifiable and consistency is required
Grid DM today does not address modifiable data setsas soon as more than one copy of the data exists otherwise, result would be either inconsistency or requires close coordination between storage locations (slow) or almost guarantees a deadlock
Some wide-area distributed file systems do this (AFS,DFS) but are not scalable or require a highly available network
Grid Middleware VI 35
Grid Storage concepts: Catalogues
Catalogues index of files that link to a single object (referenced by GUID) Catalogues logically a VO function, with local instances per site
Capabilities expose mappings, not actual data
File or Meta-data Catalogue: names, metadata -> GUID Replica Catalogue and Index:
GUID - SURLs for all SEs containing the file
Grid Middleware VI 36
File Catalogues
Grid Middleware VI 37
graphic: Peter Kunszt, EGEE DJRA1.4 gLite Architecture
Grid Middleware VI 38
Alternatives to the File Catalogue
Store SURLs with data in application DB schema better adapted to the application needs easier integration in existing frameworks
Grid Middleware VI 39
Grid Storage Concepts: Transfer Service
Transfer service responsible for moving (replicating) data between SEs transfers are scheduled, as data movement capacity is scarce
(not because of WAN network bandwidth, but because of CPU capacity and disk/tape bandwidth in data movement nodes!)
logically a per VO function, hosted at the site builds on top of the SE abstraction and a data movement protocol
and is co-ordinated with a specific SE
Capabilities transfer SURL at SE1 to new SURL at SE2
using SE mechanisms such as SRM-COPY, or directly GridFTP either push or pull
subject to a set of policies, e.g. max. number of simultaneous transfers between SE1 and SE2 with specific timeout or #retries
asynchronous states like: SUBMITTED, PENDING, ACTIVE, CANCELLING,
CANCELLED, DONE_INCOMPLETE, DONE_COMPLETE update replica catalogues (GUID->SURL mappings)
Grid Middleware VI 40
File Transfer Service
graphic: gLite Architecture v1.0 (EGEE-I DJRA1.1)
Grid Middleware VI 41
FTS ‘Channels’
Scheduled number of transfers from one site to a (set of) other sites
below: CERNCI to sites on the OPN (next slide)
Grid Middleware VI 42
FTS channels
for scaling reasons one transfer agent for each channel, i.e. each SRC<->TGT pair agents can be spread over multiple boxes
Grid Middleware VI 43
LHC
OPN
Grid Middleware VI 44
in network terms
Cricket graph 2006 CERN->SARA via OPN link speed is 10 Gb/s
Grid Middleware VI 45
FTS complex services
Protocol translation although many will, not all SEs support GridFTP FTS in that case needs protocol translation
translation through memory excludes third-party transfers
Other Issues credential handling
files on the source and target SE are readable for specific users and specific VO (groups)
SEs are site services, and sites want to be access by the end-user credential for tracability (not a generic “VO” account)
continued access to the user credential needed (like in any compute broker)
Grid Middleware VI 46
Grid Storage Concept: File Placement
Placement Service manage transfers for which the host site is the destination coordinate updates up the VO file catalogue and the actual
transfers (via the FTS, a site-managed service)
Capabilities transfer GUID or LFN from A to B
(note: the FTS could only operate on SURLs) needs access to the VO catalogues,
and thus needs sufficient privileges to do the job(i.e. update the catalogues)
API can be the same as for the FTS
Grid Middleware VI 47
Data Scheduler
Like the placement service, but can direct requests to different sites
Grid Middleware VI 48
DM: Putting it all together
graphic: gLite Architecture v1.0 (EGEE-I DJRA1.1)
Grid Middleware VI 49
GT4 view on the same issues
Similar functionalitybut more closely linked to the VO than the site
based on soft-state registrations(like the information system)
treats files as the basic resource abstraction
next two slides: Ann Chervenak, ISI/USC: Overview of GT4 Data Management Services, 2004
Grid Middleware VI 50
LRC LRC LRC
RLIRLI
LRCLRC
Replica Location Indexes
Local Replica Catalogs
• Replica Location Index (RLI) nodes aggregate information about one or more LRCs
• LRCs use soft state update mechanisms to inform RLIs about their state: relaxed consistency of index
• Optional compression of state updates reduces communication, CPU and storage overheads
• Membership service registers participating LRCs and RLIs and deals with changes in membership
RLS Framework
• Local Replica Catalogs (LRCs) contain consistent information about logical-to-target mappings
Grid Middleware VI 51
Replica Location Service In ContextReplica Location Service In Context
Replica Location ServiceReliable Data
Transfer Service
GridFTP
Reliable Replication Service
Replica Consistency Management Services
MetadataService
The Replica Location Service is one component in a layered data management architectureProvides a simple, distributed registry of mappingsConsistency management provided by higher-level services
Grid Middleware VI 52
Access Control Lists
Catalogue level protects access to meta-data is only advisory for actual file access
unless the storage system only accepts connections from a trusted agent that does itself a catalogue lookup
SE level either natively (i.e. supported by both the SRM and transfer services)
or via an agent-system like gLiteIO SRM/transfer level
SRM and GridFTp server need to lookup in local ACL store access rights for each transfer
need “all files owned by SRM” unless underlying FS supports ACLs OS level
native POSIX-ACL support in OS needed only available for limited number of systems (mainly disk based) not (yet) in popular HSM solutions
Grid Middleware VI 53
Grid ACL considerations
Semantics Posix semantics require that you traverse up the tree to find all
constraints behaviour both costly and possibly undefined in a distributed
context
VMS and NTFS container semantics are self-contained taken as a basis for the ACL semantics in many grid services
ACL syntax & local semantics typically Posix-style
Grid Middleware VI 54
Catalogue ACL method in GT4 with WS-RF
LRC
Policy Engine
Policy Database
LFN1 PIDA
LFN2 PIDB
PIDAgroup1: read; group2: all;
group 3: none; user7: read
PIDBgroup1: read, write;
group2: all; group 3: all
(1) Client Request
GT
4 A
utho
rizat
ion
Fra
mew
ork
(3) Request PIDs for logical names
(6) Query policies for PIDs
(2) Custom auth callout (includes client request)
(8) permit or deny
(9) If permitted, pass client request to LRC
Custom PDP
(5) Pass policy ID, subject, object, action
(7) permit or deny
(4) PIDs
graphic: Ann Chervenak, ISI/USC, from presentation to the Design Team, Argonne, 2005
Stand-alone solutionsSRB
the SDSC Storage Request Broker
Grid Middleware VI 56
SRB Data Management Objectives
Automate all aspects of data management Discovery (without knowing the file name) Access (without knowing its location) Retrieval (using your preferred API) Control (without having a personal account at the remote storage
system) Performance (use latency management mechanisms to minimize
impact of wide-area-networks)
source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC
Grid Middleware VI 57
SRBserver
SRB agent
SRBserver
Federated SRB server model
MCAT
Application
SRB agent
1
2
34
6
5
Logical NameOr
Attribute Condition
1.Logical-to-Physical mapping2.Identification of Replicas3.Access & Audit Control
Peer-to-peer
Brokering
Server(s) Spawning
Parallel Data Access
R1R2
5/6
source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC
Grid Middleware VI 58
Features
Authentication: encrypted password GSI, certificate based
Metadata has it all storage in a (definable) flat file system Data put into Collections (unix directories), access and
control operation possible parallel transport of files Physical Resources combine to Logical Resource Encrypted data and/or encrypted metadata Free-ish (educational) commercial version of an old SRB at
http://www.nirvanastorage.com
source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC
Grid Middleware VI 59
Unix Shell
Java, NTBrowsers
OAIWSDL
GridFTP
SDSC Storage Resource Broker & Meta-data Catalog
ArchivesHPSS, ADSM,UniTree, DMF
DatabasesDB2, Oracle,
Postgres
File SystemsUnix, NT,Mac OSX
Application
HRMORB
AccessAPIs
Servers
Storage AbstractionCatalog Abstraction
DatabasesDB2, Oracle, Postgres,
SQLServer, Informix
C, C++, Libraries
Logical Name Space
LatencyManagement
DataTransport
MetadataTransport
Consistency Management / Authorization-AuthenticationPrimeServer
Linux I/O
DLL /Python
source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC
Grid Middleware VI 60
Production Data Grid
SDSC Storage Resource Broker Federated client-server system, managing
Over 70 TBs of data at SDSC Over 10 million files
Manages data collections stored in Archives (HPSS, UniTree, ADSM, DMF) Hierarchical Resource Managers Tapes, tape robots File systems (Unix, Linux, Mac OS X, Windows) FTP sites Databases (Oracle, DB2, Postgres, SQLserver,
Sybase, Informix) Virtual Object Ring Buffers
source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC
Grid Middleware VI 61
Mappings on Name Space Define logical resource name
List of physical resources Replication
Write to logical resource completes when all physical resources have a copy
Load balancing Write to a logical resource completes when copy exist on next
physical resource in the list Fault tolerance
Write to a logical resource completes when copies exist on “k” of “n” physical resources
source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC
Grid Middleware VI 62
SRB Development
Now at version 3.4 (as of November 2005) Peer-to-peer federation of ZONES
Support multiple independent MCAT catalogs Replicate metadata
mySQL/BerkeleyDB port OGSA/OGSI compliant interface GridFTP interfaces
source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC
Grid Middleware VI 63
User Interfaces
Unix Command line tools: S-commands (e.g. Sls, Spwd, Sget, Sput)
Windows SRB browser: InQ Web Interface: mySRB java and C API. java admin tools
DEMO
source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC
Grid Middleware VI 64
Administrative Interface
Also available as Unix command
java based admin tool
source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC
Grid Middleware VI 65
Unix Command-line Tool S*
source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC
Grid Middleware VI 66
Windows Browser InQ
source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC
Grid Middleware VI 67
Web Interface
source: Maurice Bouwhuis, SARA, based on data by Reagan Moore, SDSC
Grid Middleware VI 68
Nice and Not so Nice
+ It works and is being used in “production”+ metadata based + it knows GSI and will know gridFTP- for S-commands password in plain text in file
(should not be necessary)- InQ does not know GSI- Not all interfaces have same capabilities
source: Maurice Bouwhuis, SARA
Structured DataOGSA-DAI
Grid Middleware VI 70
Access to structured data
Several layers access layer
do not virtualise schema and semantics, ‘just get there’ OGSA-DAI, Spitfire (depricated)
semantic layer interpret and attempt to merge schemas using ontology
discovery a research topic today, with some interesting results see e.g. the April VL-e workshop for some nice examples
Grid Middleware VI 71
OGSA-DAI
An extensible framework for data access and integration. Expose heterogeneous data resources to a grid through
web services. Interact with data resources:
Queries and updates. Data transformation / compression Data delivery.
Customise for your project using Additional Activities Client Toolkit APIs Data Resource handlers
A base for higher-level services federation, mining, visualisation,…
http://www.ogsadai.org.uk/
source: Amy Krause, EPCC Edinburgh: OGSA-DAI Overview, GGF17, Tokyo, 2006
Grid Middleware VI 72
Considerations
Efficient client-server communication One request specifies multiple operations
No unnecessary data movement Move computation to the data Utilise third-party delivery Apply transforms (e.g., compression)
Build on existing standards Fill-in gaps where necessary: specifications from DAIS WG
Do not hide underlying data model Users must know where to target queries, Data virtualisation is hard
Extensible architecture Extensible activity framework
Cannot anticipate all desired functionality Allow users to plug-in their own
based on: Amy Krause, EPCC Edinburgh: OGSA-DAI Overview, GGF17, Tokyo, 2006
Grid Middleware VI 73
OGSA-DAI services
OGSA-DAI uses data services to represent and provide access to a number of data resources
acce
sses
represents
Data Service
DataResource
DataResourceData
Resource
acce
sses
based on: Amy Krause, EPCC Edinburgh: OGSA-DAI Overview, GGF17, Tokyo, 2006
Grid Middleware VI 74
Services
Services co-located with the data as much as possible
MySQL
OGSA-DAI service
Engine
SQLQuery
JDBCData ServiceResources
Activities
DB2
GZip GridFTPXPath
XMLDB
eXist
readFile
File
SWISSPROT
ToCSV
SQLServer
Data-bases
ApplicationApplicationClient ToolkitClient Toolkit
based on: Amy Krause, EPCC Edinburgh: OGSA-DAI Overview, GGF17, Tokyo, 2006
Grid Middleware VI 75
Supported data sources
Relational XML Files
MySQLDB2Oracle 10SQLServerPostgreSQL
eXistXindice
Text FilesBinary FilesCSVSwissProtOMIM
based on: Amy Krause, EPCC Edinburgh: OGSA-DAI Overview, GGF17, Tokyo, 2006
Grid Middleware VI 76
Service interaction
Data Service
Activity
Activity
Activity
Client
Data Sink
<?xml?><perform>….</perform>
<?xml?><perform>….</perform>
<?xml/><response>….</response>
<?xml/><response>….</response>
…011010011101100…
based on: Amy Krause, EPCC Edinburgh: OGSA-DAI Overview, GGF17, Tokyo, 2006
Grid Middleware VI 77
Data Service internals
from: Alexander Wöhrer, AustrianGrid OGSA-DAI tutorial, GGF13 Seoul, 2005
Grid Middleware VI 78
Request/response
<perform xmlns=“…" xmlns:xsi=“…“ xsi:schemaLocation=“…"> <sqlQueryStatement name="statement"> <expression> select * from littleblackbookwhere id=10 </expression> <resultSetStream name=“output"/> </sqlQueryStatement> <deliverToURLname="deliverOutput"> <fromLocal from=“output"/> <toURL>ftp://anon:frog@ftp.example.com/home</toURL> </deliverToURL></perform>
<gridDataServiceResponse xmlns=“…"> <result name="deliverOutput" status=“COMPLETED"/> <result name="statement" status=“COMPLETED"/></gridDataServiceResponse>
from: Alexander Wöhrer, AustrianGrid OGSA-DAI tutorial, GGF13 Seoul, 2005
Grid Middleware VI 79
Client library interaction
SQLQuerySQLQuery query = new SQLQuery("select * from littleblackbook
where id='3475'") XPathQueryXPathQuery query = new XPathQuery( "/entry[@id<10]" );
XSLTransformXSLTransform transform = new XSLTransform();
DeliverToGFTP; DeliverToGFTP deliver = new DeliverToGFTP("ogsadai.org.uk", 8080,
"myresults.txt" );
you have to know the backend structure of the data source
from: Alexander Wöhrer, AustrianGrid OGSA-DAI tutorial, GGF13 Seoul, 2005
Grid Middleware VI 80
Simple requests
Simple requests consist of only one activity Send the activity directly to the perform
method
SQLQuery query = new SQLQuery( "select * from littleblackbookwhere id='3475'");
Response response= service.perform( query );
from: Alexander Wöhrer, AustrianGrid OGSA-DAI tutorial, GGF13 Seoul, 2005
Closing Remarks
Grid Middleware VI 82
Miscellaneous tidbits
Data Curationthe need to preserve data over time migrating media (preserve readablility) is only one aspect need also
format conversion or emulation of the programs operating on the data
Data Provenanceneed to know how this data has come into being association of meta-data and work flow recording of workflow and w/f instances in essential this is (today) application specific, but maybe, one day, …
Recommended