34
INFSO-RI-508833 Enabling Grids for E-sciencE www.eu- egee.org Summary of the data access session EGEE User Forum, March 3 rd , 2006 Johan Montagnat Birger Koblitz

INFSO-RI-508833 Enabling Grids for E-sciencE Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

Embed Size (px)

DESCRIPTION

EGEE UF, March 3 rd, Summary of the data access session 3 Enabling Grids for E-sciencE INFSO-RI Agenda Panel on metadata and databases access GDSE: data source oriented computing element –Dr. Giuliano Taffoni, INFN, CNAF ATLAS metadata interface –Thomad Doherty, University of Glasgow The AMGA metadata service –Dr. Birger Koblitz, CERN Oracle on the grid –Bjorn Engsing, Oracle

Citation preview

Page 1: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

Summary of the data access session

EGEE User Forum, March 3rd, 2006

Johan MontagnatBirger Koblitz

Page 2: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 2

Enabling Grids for E-sciencE

INFSO-RI-508833

Data access parallel session• ~60 persons attending (pick activity)

• Talks where grouped in 3 different panels– Metadata and databases access– File access– Applications

• 3 associated demonstrations

Page 3: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 3

Enabling Grids for E-sciencE

INFSO-RI-508833

Agenda• Panel on metadata and databases access

• GDSE: data source oriented computing element– Dr. Giuliano Taffoni, INFN, CNAF

• ATLAS metadata interface– Thomad Doherty, University of Glasgow

• The AMGA metadata service– Dr. Birger Koblitz, CERN

• Oracle on the grid– Bjorn Engsing, Oracle

Page 4: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 4

Enabling Grids for E-sciencE

INFSO-RI-508833

DSE: Data Source Engine

• We define a new Grid component (G-DSE) that enables the access to a Data Source Engine and Data Source, totally integrated with the Grid Monitoring and Discovery System and Resource Broker.

• The new Grid Element, finally, can be built on top of the G-DSE component.

• Handle very long SQL queries just like a CE would handle jobs.

the Query Element

Page 5: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 5

Enabling Grids for E-sciencE

INFSO-RI-508833

GDSE integration

gatekeeper

JobManger QueryManger

JobProcess QueryProcess

Scheduler p-in

Pbs/LFS

query plug-in

Query DB specific driver

GRAM GIS

RDBMS

MDS

GRIS

Ldapldif

RDBMS

Grid Providers (snmp)

Page 6: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 6

Enabling Grids for E-sciencE

INFSO-RI-508833

Features• Data source indexing, monitoring, management and

recovery• GRAM or WS protocol• Transactions/queries specified through RSL/JDL• The grid WMS is used to support the execution• The grid IS is used to monitor the transactions• GSI and VOMS based access control

– Different roles (administrator, writer, selecter)– Access control at tables and rows level

• Connects to different RDBMS• Supports workflows of query jobs with inter-

dependencies• Support for replication• Application to AstroDBs

Page 7: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 7

Enabling Grids for E-sciencE

INFSO-RI-508833

ATLAS Metadata Inteface

• It is a developing application, which stores and allows access to dataset metadata for the ATLAS experiment

• It fulfils the need of many database-backed applications by offering a generic web service and servlet interface, through the use of self-describing databases

• supports geographical distribution with the use of web services and secure access with the use of grid-certificates

Page 8: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 8

Enabling Grids for E-sciencE

INFSO-RI-508833

Adaptation of AMI Architecture for gLite Interfaces

Web Service client

gLite Interfacemethod

gLite Interface Implementation

Controller class

Result returned in XML format

Page 9: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 9

Enabling Grids for E-sciencE

INFSO-RI-508833

Features• Supports Oracle, MySQL and SQLite DBs• gLite metadata interface• Web Service interface (AXIS container in tomcat)• Authentication: based on certificate DN• Very fine grain authorization

– Roles– At project or records level– May write ad hoc control classes

• Secured and well defined interface for providing access to metadata

Page 10: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 10

Enabling Grids for E-sciencE

INFSO-RI-508833

AMGA: ARDA metadata interface• gLite 1.5 metadata catalog• Two modes

– With the LFC: bind metadata to files– Standalone: general relational data

• Front ends– Web Service– proprietary TCP streaming protocol

• Implementing the gLite metadata interface• Versatile, provides both performance and security• Security components (optional)

– SSL connections– Password/X509 certificates/proxies based authentication– Posix-ACLs and Unix permissions at table and row level

• Applications: LHCb, Medical Data Management, gLibrary, UnoSat...

Page 11: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 11

Enabling Grids for E-sciencE

INFSO-RI-508833

Performances

• Comparison with LFC and FireMan catalogs

Page 12: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 12

Enabling Grids for E-sciencE

INFSO-RI-508833

Replication & Federation modes

Page 13: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 13

Enabling Grids for E-sciencE

INFSO-RI-508833

Oracle• Free Oracle software

– Express edition, limited to 1 CPU

• Support for Linux on many distributions

• Provides streams for replication

Page 14: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 14

Enabling Grids for E-sciencE

INFSO-RI-508833

Discussion• Standards

– How does gLite commit to standards– Lot of GGF work invested in defining standards– Difficult to endorse standards as they are evolving and the global

picture is not so clear today• Security

– Common concern, different granularities• Replication

– Partially implemented in existing databases, different semantics– Should this be implemented at a higher level?

• Distribution– Some work on information schemas for locating metadata– What about queries on a priori unlocated data?

• Grids of databases– Let the grid pick the “best” database for you

• There is room for more research activity!

Page 15: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 15

Enabling Grids for E-sciencE

INFSO-RI-508833

Agenda• Panel on file access

• gLite File Transfer Service– Paolo Badino, CERN

• Encrypted Data Storage in EGEE– Akos Frohner, CERN

• Storage Resource Manager Interface– Maarten Litmaath, CERN

Page 16: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 16

Enabling Grids for E-sciencE

INFSO-RI-508833

File Transfer Service channels• Logical unit of management

– Represent a directed network pipe between two sites• Mono-directional• Independently manageable

– State– Number of streams – Number of concurrent transfers

• Inter-VO scheduling– VO share

• No Routing

• Between specific host pairs group of hosts

Page 17: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 17

Enabling Grids for E-sciencE

INFSO-RI-508833

Transfer Jobs and Files• Job

– Represent the transfer request– Identified by a GUID

• File– source-destination file names pair

Job States File States

Page 18: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 18

Enabling Grids for E-sciencE

INFSO-RI-508833

What SC achieved so far• SC3 Rerun (January 2006)• All sites achieved target rate• 8/11 sites achieved nominal rate

Page 19: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 19

Enabling Grids for E-sciencE

INFSO-RI-508833

Encryption/Decryption System• Designed to fulfill biomedical application needs

– Fine grain access control– Data encryption– Anonimity

• Based on gLiteIO, FiReMan and an SRM v1.1• Access control through gLiteIO

Page 20: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 20

Enabling Grids for E-sciencE

INFSO-RI-508833

Encryption• Anonimity: patient data separated from files (stored in

AMGA)• ACL access control on files (FiReMan)• File keys distributed among Hydra servers with ACL

Page 21: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 21

Enabling Grids for E-sciencE

INFSO-RI-508833

And decryption• Key retrieved from the Hydra key server• Data decrypted block by block in memory (OpenSSL

cyphers)• Encryption also works for output data

Page 22: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 22

Enabling Grids for E-sciencE

INFSO-RI-508833

What is the SRM?• Client-server interface for Storage Resource

Management– De facto standard (see further on), GGF working group

http://sdm.lbl.gov/srm-wg/– Secure web service– Defines functions that allow storage resources to be managed

from both client and server perspectives Different requirements, optimizations, concerns

• SRM collaboration institutes develop different implementations– CERN + RAL + INFN (CASTOR-2)– CERN/LCG (DPM)– FNAL + DESY (dCache)– JLAB (J-SRM)– LBNL (DRM, HRM)– EGRID/INFN/GridIt (StoRM)

Page 23: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 23

Enabling Grids for E-sciencE

INFSO-RI-508833

Is the SRM a standard?

• “The nice thing about standards is that there are so many to choose from.” - Andrew S. Tanenbaum

• Version 1.1 in widespread use– But implementations have subtle incompatibilities due to

ambiguities in the “standard”– Various basic functionalities not defined

• Version 2.1 implemented to various extents by some projects– Try to get a critical subset implemented on WLCG by autumn 2006

Use cases defined by LHC experiments, see next pages– Still lacks some features– Incompatible with version 1

Clients and servers need to support both versions during transition period (May last a long time)

• Version 3 definition many months away– Again incompatible

Page 24: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 24

Enabling Grids for E-sciencE

INFSO-RI-508833

What should the SRM do? (A. Shoshani, PPDG Review, 28 Apr 2003)

• Manage space dynamically – Any disk caches and Mass Storage Systems – Space reservation and negotiation – Manage “lifetime” of spaces

• Manage files dynamically – Pin files in storage till they are released – Manage “lifetime” of files, and action when lifetime expires

• Manage file sharing – Policies on what to evict when space is needed

Currently always decided by back-end • Manage multi-file requests

– A brokering function: queue file requests, pre-stage files – Invoke file transfer services

• Permit site-SRM over multiple storage systems • Negotiate transfer protocols

Page 25: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 25

Enabling Grids for E-sciencE

INFSO-RI-508833

Discussion• Connection between Data Management and jobs

scheduling– The file catalog holds information on files location used for

scheduling– Jobs are scheduled where data sits– In some cases, data could move where resources are available for

computations.– Is this desirable?

• Legacy code is common in scientific applications– Transparent POSIX access

• Data encryption– Transparency

Page 26: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 26

Enabling Grids for E-sciencE

INFSO-RI-508833

Agenda• Panel on applications• Space Physics Interactive Data Resource – SPIDR

– Dr. Zhinzhin, Russian Academy of Science• DLibrary: a multimedia contents manager system

– Dr. Tony Calanducci, INFN Catania

Page 27: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 27

Enabling Grids for E-sciencE

INFSO-RI-508833

Space Physics Interactive Data Resources SPIDR

SPIDR is a de facto standard data source on solar-terrestrial physics, functioning within the framework of the ICSU World Data Centers.

It is a distributed database and application server network, built to select, visualize and model historical space weather data distributed across the Internet.

SPIDR can work as a fully-functional web-application (portal) or as a grid of web-services, providing functions for other applications to access its data holdings.

Page 28: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 28

Enabling Grids for E-sciencE

INFSO-RI-508833

SPIDR components

SPIDR portal combines the central XML metadata repository with a set of distributed data web services and data file collections. A user can search for data using metadata inventory, use persistent data basket to save the selection for the next session, and plot or download in parallel the selected data in different formats, including XML and NetCDF.

Virtual Community ofRegistered Users

Virtual ObservatoryMetadata

VirtualData Sources

Authenticate

Find event

Get data

User results

queries

Web Portal:Workflow, Data Ingest, Mining,

Visualization and Delivery

Page 29: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 29

Enabling Grids for E-sciencE

INFSO-RI-508833

Real-time usage statisicsfor a given time interval

User sessionsper day

Total ~20 000registered users

Per database requests for plot (red) and export (blue)

Page 30: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 30

Enabling Grids for E-sciencE

INFSO-RI-508833

gLibrary usage scenarios

• Example 1:– Locate all theoretical (PPTType) PowerPoint (Type)

presentations about FireMan (Keywords) given in 2005 (Date) by Uncle Sam (Speaker);

– Find all the movies (Type) in which Julia Roberts (Cast) performed together with Hugh Grant (Cast) produced in USA (Country) in 2004 (ReleaseDate); or all the acoustic (Genre) mp3 (Format) audio files (Type) of Alanis Morissette (Singer) that last more than 3 minutes (Runtime).

• Example 2:– A doctor is looking for brain (keyword) DICOM (Type) images of

male (Gender) patients older than 65 (Age).• Example 3:

– A job can behave as a storage crawler: it scans pre-existing files in Storage Elements to extract relevant metadata that will be published on gLibrary for further data mining.

Page 31: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 31

Enabling Grids for E-sciencE

INFSO-RI-508833

Example of gLibrary collections

/gLPPTPowerPoint

/EGEEPPTEGEEDOC

/gLDOCDocuments

/gLVideoVideo

/gLImageImage

/gLAudioAudio

Path (refers to a collection)

AttributesEntry names

/gLTypesCollection

Theorical

Type00:30:00

RuntimeValeria Ardizzione, Giuseppe La Rocca

AuthorR-GMA, BDII

TopicGiuseppe La Rocca, Valeria Ardizzone

Speaker4th EGEE Conference

Event2005-10-23

DateInformation Systems

00454dca-a269-4b93-8a45-c4012af05600

Title

AttributesEntry names

/EGEEPPTCollection

Pop

Genre00:03:27

DurationDedicato A Te

AlbumMP3

FormatLe Vibrazioni

SingerDedicato A Te4ffaffc8-26e7-4826-

b460-3d5bf08081a4

SongTitle

AttributesEntry names

/gLAudioCollection

ardizzo00454dca-a269-4b93-8a45-c4012af05600

Passphrase

AttributesEntry names

/gLKeysCollection

“additional features”

Page 32: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 32

Enabling Grids for E-sciencE

INFSO-RI-508833

gLibrary Security• User Requirements:

– a valid proxy with VOMS extensions– VOMS Role and Group needed to be recognized by gLibrary as a

contents manager.• 3 kinds of users:

– gLibraryManager: (s)he can create new content type and allows a generic VO user to become gLibrarySubmitter

– gLibrarySubmitters: they can add new entries and define access rights on the entries they create.

Fine-grained permission (reading, writing, listing, decrypting) settings on each entry: whole VO members, VO groups, list of DNs

– generic VO users: browse and make queries (on entries they have access to)

• Basic level of cryptography:– New files saved on SEs can be encrypted beforehand with a symmetric

passphrase that will be saved in /gLKeys. Only selected users (that have a specific DN in the subject of their VOMS proxy) can access the passphrase and decrypt the file.

Page 33: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 33

Enabling Grids for E-sciencE

INFSO-RI-508833

Features• Born as an use case to demonstrate AMGA features• Built on top of many gLite services• Considering collaboration and integration with NA3

Document Digital Library System • Fast → thanks to AMGA• Secure → ACLs, encryption, and splitting• Easy to use → User friendly Java GUI and portal soon

available • Easily extensible to support any document types

(Medical Images and files, Invoices, Proceedings, Scientific Publications, Newspapers clips, …)

Page 34: INFSO-RI-508833 Enabling Grids for E-sciencE   Summary of the data access session EGEE User Forum, March 3 rd, 2006 Johan Montagnat Birger

EGEE UF, March 3rd, Summary of the data access session 34

Enabling Grids for E-sciencE

INFSO-RI-508833

Discussion• SPIDR want to use grids for

– Security and access control– Asynchronous access to large amount of data

• gLibrary– Flexibility of the schema to adapt to many document types– Content analysis / indexing of documents

• Very different needs for database access => room for many solutions:– GDSE: Time consuming jobs on databases– AMGA: Fast access to small amounts of (returned) metadata– SPIDR: Asynchronous access to large amounts of metadata