Upload
melvin-robbins
View
219
Download
0
Embed Size (px)
DESCRIPTION
EGEE UF, March 3 rd, Summary of the data access session 3 Enabling Grids for E-sciencE INFSO-RI Agenda Panel on metadata and databases access GDSE: data source oriented computing element –Dr. Giuliano Taffoni, INFN, CNAF ATLAS metadata interface –Thomad Doherty, University of Glasgow The AMGA metadata service –Dr. Birger Koblitz, CERN Oracle on the grid –Bjorn Engsing, Oracle
Citation preview
INFSO-RI-508833
Enabling Grids for E-sciencE
www.eu-egee.org
Summary of the data access session
EGEE User Forum, March 3rd, 2006
Johan MontagnatBirger Koblitz
EGEE UF, March 3rd, Summary of the data access session 2
Enabling Grids for E-sciencE
INFSO-RI-508833
Data access parallel session• ~60 persons attending (pick activity)
• Talks where grouped in 3 different panels– Metadata and databases access– File access– Applications
• 3 associated demonstrations
EGEE UF, March 3rd, Summary of the data access session 3
Enabling Grids for E-sciencE
INFSO-RI-508833
Agenda• Panel on metadata and databases access
• GDSE: data source oriented computing element– Dr. Giuliano Taffoni, INFN, CNAF
• ATLAS metadata interface– Thomad Doherty, University of Glasgow
• The AMGA metadata service– Dr. Birger Koblitz, CERN
• Oracle on the grid– Bjorn Engsing, Oracle
EGEE UF, March 3rd, Summary of the data access session 4
Enabling Grids for E-sciencE
INFSO-RI-508833
DSE: Data Source Engine
• We define a new Grid component (G-DSE) that enables the access to a Data Source Engine and Data Source, totally integrated with the Grid Monitoring and Discovery System and Resource Broker.
• The new Grid Element, finally, can be built on top of the G-DSE component.
• Handle very long SQL queries just like a CE would handle jobs.
the Query Element
EGEE UF, March 3rd, Summary of the data access session 5
Enabling Grids for E-sciencE
INFSO-RI-508833
GDSE integration
gatekeeper
JobManger QueryManger
JobProcess QueryProcess
Scheduler p-in
Pbs/LFS
query plug-in
Query DB specific driver
GRAM GIS
RDBMS
MDS
GRIS
Ldapldif
RDBMS
Grid Providers (snmp)
EGEE UF, March 3rd, Summary of the data access session 6
Enabling Grids for E-sciencE
INFSO-RI-508833
Features• Data source indexing, monitoring, management and
recovery• GRAM or WS protocol• Transactions/queries specified through RSL/JDL• The grid WMS is used to support the execution• The grid IS is used to monitor the transactions• GSI and VOMS based access control
– Different roles (administrator, writer, selecter)– Access control at tables and rows level
• Connects to different RDBMS• Supports workflows of query jobs with inter-
dependencies• Support for replication• Application to AstroDBs
EGEE UF, March 3rd, Summary of the data access session 7
Enabling Grids for E-sciencE
INFSO-RI-508833
ATLAS Metadata Inteface
• It is a developing application, which stores and allows access to dataset metadata for the ATLAS experiment
• It fulfils the need of many database-backed applications by offering a generic web service and servlet interface, through the use of self-describing databases
• supports geographical distribution with the use of web services and secure access with the use of grid-certificates
EGEE UF, March 3rd, Summary of the data access session 8
Enabling Grids for E-sciencE
INFSO-RI-508833
Adaptation of AMI Architecture for gLite Interfaces
Web Service client
gLite Interfacemethod
gLite Interface Implementation
Controller class
Result returned in XML format
EGEE UF, March 3rd, Summary of the data access session 9
Enabling Grids for E-sciencE
INFSO-RI-508833
Features• Supports Oracle, MySQL and SQLite DBs• gLite metadata interface• Web Service interface (AXIS container in tomcat)• Authentication: based on certificate DN• Very fine grain authorization
– Roles– At project or records level– May write ad hoc control classes
• Secured and well defined interface for providing access to metadata
EGEE UF, March 3rd, Summary of the data access session 10
Enabling Grids for E-sciencE
INFSO-RI-508833
AMGA: ARDA metadata interface• gLite 1.5 metadata catalog• Two modes
– With the LFC: bind metadata to files– Standalone: general relational data
• Front ends– Web Service– proprietary TCP streaming protocol
• Implementing the gLite metadata interface• Versatile, provides both performance and security• Security components (optional)
– SSL connections– Password/X509 certificates/proxies based authentication– Posix-ACLs and Unix permissions at table and row level
• Applications: LHCb, Medical Data Management, gLibrary, UnoSat...
EGEE UF, March 3rd, Summary of the data access session 11
Enabling Grids for E-sciencE
INFSO-RI-508833
Performances
• Comparison with LFC and FireMan catalogs
EGEE UF, March 3rd, Summary of the data access session 12
Enabling Grids for E-sciencE
INFSO-RI-508833
Replication & Federation modes
EGEE UF, March 3rd, Summary of the data access session 13
Enabling Grids for E-sciencE
INFSO-RI-508833
Oracle• Free Oracle software
– Express edition, limited to 1 CPU
• Support for Linux on many distributions
• Provides streams for replication
EGEE UF, March 3rd, Summary of the data access session 14
Enabling Grids for E-sciencE
INFSO-RI-508833
Discussion• Standards
– How does gLite commit to standards– Lot of GGF work invested in defining standards– Difficult to endorse standards as they are evolving and the global
picture is not so clear today• Security
– Common concern, different granularities• Replication
– Partially implemented in existing databases, different semantics– Should this be implemented at a higher level?
• Distribution– Some work on information schemas for locating metadata– What about queries on a priori unlocated data?
• Grids of databases– Let the grid pick the “best” database for you
• There is room for more research activity!
EGEE UF, March 3rd, Summary of the data access session 15
Enabling Grids for E-sciencE
INFSO-RI-508833
Agenda• Panel on file access
• gLite File Transfer Service– Paolo Badino, CERN
• Encrypted Data Storage in EGEE– Akos Frohner, CERN
• Storage Resource Manager Interface– Maarten Litmaath, CERN
EGEE UF, March 3rd, Summary of the data access session 16
Enabling Grids for E-sciencE
INFSO-RI-508833
File Transfer Service channels• Logical unit of management
– Represent a directed network pipe between two sites• Mono-directional• Independently manageable
– State– Number of streams – Number of concurrent transfers
• Inter-VO scheduling– VO share
• No Routing
• Between specific host pairs group of hosts
EGEE UF, March 3rd, Summary of the data access session 17
Enabling Grids for E-sciencE
INFSO-RI-508833
Transfer Jobs and Files• Job
– Represent the transfer request– Identified by a GUID
• File– source-destination file names pair
Job States File States
EGEE UF, March 3rd, Summary of the data access session 18
Enabling Grids for E-sciencE
INFSO-RI-508833
What SC achieved so far• SC3 Rerun (January 2006)• All sites achieved target rate• 8/11 sites achieved nominal rate
EGEE UF, March 3rd, Summary of the data access session 19
Enabling Grids for E-sciencE
INFSO-RI-508833
Encryption/Decryption System• Designed to fulfill biomedical application needs
– Fine grain access control– Data encryption– Anonimity
• Based on gLiteIO, FiReMan and an SRM v1.1• Access control through gLiteIO
EGEE UF, March 3rd, Summary of the data access session 20
Enabling Grids for E-sciencE
INFSO-RI-508833
Encryption• Anonimity: patient data separated from files (stored in
AMGA)• ACL access control on files (FiReMan)• File keys distributed among Hydra servers with ACL
EGEE UF, March 3rd, Summary of the data access session 21
Enabling Grids for E-sciencE
INFSO-RI-508833
And decryption• Key retrieved from the Hydra key server• Data decrypted block by block in memory (OpenSSL
cyphers)• Encryption also works for output data
EGEE UF, March 3rd, Summary of the data access session 22
Enabling Grids for E-sciencE
INFSO-RI-508833
What is the SRM?• Client-server interface for Storage Resource
Management– De facto standard (see further on), GGF working group
http://sdm.lbl.gov/srm-wg/– Secure web service– Defines functions that allow storage resources to be managed
from both client and server perspectives Different requirements, optimizations, concerns
• SRM collaboration institutes develop different implementations– CERN + RAL + INFN (CASTOR-2)– CERN/LCG (DPM)– FNAL + DESY (dCache)– JLAB (J-SRM)– LBNL (DRM, HRM)– EGRID/INFN/GridIt (StoRM)
EGEE UF, March 3rd, Summary of the data access session 23
Enabling Grids for E-sciencE
INFSO-RI-508833
Is the SRM a standard?
• “The nice thing about standards is that there are so many to choose from.” - Andrew S. Tanenbaum
• Version 1.1 in widespread use– But implementations have subtle incompatibilities due to
ambiguities in the “standard”– Various basic functionalities not defined
• Version 2.1 implemented to various extents by some projects– Try to get a critical subset implemented on WLCG by autumn 2006
Use cases defined by LHC experiments, see next pages– Still lacks some features– Incompatible with version 1
Clients and servers need to support both versions during transition period (May last a long time)
• Version 3 definition many months away– Again incompatible
EGEE UF, March 3rd, Summary of the data access session 24
Enabling Grids for E-sciencE
INFSO-RI-508833
What should the SRM do? (A. Shoshani, PPDG Review, 28 Apr 2003)
• Manage space dynamically – Any disk caches and Mass Storage Systems – Space reservation and negotiation – Manage “lifetime” of spaces
• Manage files dynamically – Pin files in storage till they are released – Manage “lifetime” of files, and action when lifetime expires
• Manage file sharing – Policies on what to evict when space is needed
Currently always decided by back-end • Manage multi-file requests
– A brokering function: queue file requests, pre-stage files – Invoke file transfer services
• Permit site-SRM over multiple storage systems • Negotiate transfer protocols
EGEE UF, March 3rd, Summary of the data access session 25
Enabling Grids for E-sciencE
INFSO-RI-508833
Discussion• Connection between Data Management and jobs
scheduling– The file catalog holds information on files location used for
scheduling– Jobs are scheduled where data sits– In some cases, data could move where resources are available for
computations.– Is this desirable?
• Legacy code is common in scientific applications– Transparent POSIX access
• Data encryption– Transparency
EGEE UF, March 3rd, Summary of the data access session 26
Enabling Grids for E-sciencE
INFSO-RI-508833
Agenda• Panel on applications• Space Physics Interactive Data Resource – SPIDR
– Dr. Zhinzhin, Russian Academy of Science• DLibrary: a multimedia contents manager system
– Dr. Tony Calanducci, INFN Catania
EGEE UF, March 3rd, Summary of the data access session 27
Enabling Grids for E-sciencE
INFSO-RI-508833
Space Physics Interactive Data Resources SPIDR
SPIDR is a de facto standard data source on solar-terrestrial physics, functioning within the framework of the ICSU World Data Centers.
It is a distributed database and application server network, built to select, visualize and model historical space weather data distributed across the Internet.
SPIDR can work as a fully-functional web-application (portal) or as a grid of web-services, providing functions for other applications to access its data holdings.
EGEE UF, March 3rd, Summary of the data access session 28
Enabling Grids for E-sciencE
INFSO-RI-508833
SPIDR components
SPIDR portal combines the central XML metadata repository with a set of distributed data web services and data file collections. A user can search for data using metadata inventory, use persistent data basket to save the selection for the next session, and plot or download in parallel the selected data in different formats, including XML and NetCDF.
Virtual Community ofRegistered Users
Virtual ObservatoryMetadata
VirtualData Sources
Authenticate
Find event
Get data
User results
queries
Web Portal:Workflow, Data Ingest, Mining,
Visualization and Delivery
EGEE UF, March 3rd, Summary of the data access session 29
Enabling Grids for E-sciencE
INFSO-RI-508833
Real-time usage statisicsfor a given time interval
User sessionsper day
Total ~20 000registered users
Per database requests for plot (red) and export (blue)
EGEE UF, March 3rd, Summary of the data access session 30
Enabling Grids for E-sciencE
INFSO-RI-508833
gLibrary usage scenarios
• Example 1:– Locate all theoretical (PPTType) PowerPoint (Type)
presentations about FireMan (Keywords) given in 2005 (Date) by Uncle Sam (Speaker);
– Find all the movies (Type) in which Julia Roberts (Cast) performed together with Hugh Grant (Cast) produced in USA (Country) in 2004 (ReleaseDate); or all the acoustic (Genre) mp3 (Format) audio files (Type) of Alanis Morissette (Singer) that last more than 3 minutes (Runtime).
• Example 2:– A doctor is looking for brain (keyword) DICOM (Type) images of
male (Gender) patients older than 65 (Age).• Example 3:
– A job can behave as a storage crawler: it scans pre-existing files in Storage Elements to extract relevant metadata that will be published on gLibrary for further data mining.
EGEE UF, March 3rd, Summary of the data access session 31
Enabling Grids for E-sciencE
INFSO-RI-508833
Example of gLibrary collections
/gLPPTPowerPoint
/EGEEPPTEGEEDOC
/gLDOCDocuments
/gLVideoVideo
/gLImageImage
/gLAudioAudio
Path (refers to a collection)
AttributesEntry names
/gLTypesCollection
Theorical
Type00:30:00
RuntimeValeria Ardizzione, Giuseppe La Rocca
AuthorR-GMA, BDII
TopicGiuseppe La Rocca, Valeria Ardizzone
Speaker4th EGEE Conference
Event2005-10-23
DateInformation Systems
00454dca-a269-4b93-8a45-c4012af05600
Title
AttributesEntry names
/EGEEPPTCollection
Pop
Genre00:03:27
DurationDedicato A Te
AlbumMP3
FormatLe Vibrazioni
SingerDedicato A Te4ffaffc8-26e7-4826-
b460-3d5bf08081a4
SongTitle
AttributesEntry names
/gLAudioCollection
ardizzo00454dca-a269-4b93-8a45-c4012af05600
Passphrase
AttributesEntry names
/gLKeysCollection
“additional features”
EGEE UF, March 3rd, Summary of the data access session 32
Enabling Grids for E-sciencE
INFSO-RI-508833
gLibrary Security• User Requirements:
– a valid proxy with VOMS extensions– VOMS Role and Group needed to be recognized by gLibrary as a
contents manager.• 3 kinds of users:
– gLibraryManager: (s)he can create new content type and allows a generic VO user to become gLibrarySubmitter
– gLibrarySubmitters: they can add new entries and define access rights on the entries they create.
Fine-grained permission (reading, writing, listing, decrypting) settings on each entry: whole VO members, VO groups, list of DNs
– generic VO users: browse and make queries (on entries they have access to)
• Basic level of cryptography:– New files saved on SEs can be encrypted beforehand with a symmetric
passphrase that will be saved in /gLKeys. Only selected users (that have a specific DN in the subject of their VOMS proxy) can access the passphrase and decrypt the file.
EGEE UF, March 3rd, Summary of the data access session 33
Enabling Grids for E-sciencE
INFSO-RI-508833
Features• Born as an use case to demonstrate AMGA features• Built on top of many gLite services• Considering collaboration and integration with NA3
Document Digital Library System • Fast → thanks to AMGA• Secure → ACLs, encryption, and splitting• Easy to use → User friendly Java GUI and portal soon
available • Easily extensible to support any document types
(Medical Images and files, Invoices, Proceedings, Scientific Publications, Newspapers clips, …)
EGEE UF, March 3rd, Summary of the data access session 34
Enabling Grids for E-sciencE
INFSO-RI-508833
Discussion• SPIDR want to use grids for
– Security and access control– Asynchronous access to large amount of data
• gLibrary– Flexibility of the schema to adapt to many document types– Content analysis / indexing of documents
• Very different needs for database access => room for many solutions:– GDSE: Time consuming jobs on databases– AMGA: Fast access to small amounts of (returned) metadata– SPIDR: Asynchronous access to large amounts of metadata