Upload
ovidiu-vasilca
View
220
Download
0
Embed Size (px)
Citation preview
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
1/46
The EDGI project receives Community research
funding under grant agreement 2615561
Software Engineering for Distributed
Data Processing Infrastructures (DCIs)
Etienne URBAH [email protected] LODYGENSKY [email protected]
LAL, Univ Paris-Sud, IN2P3/CNRS, Orsay, France
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
2/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 2
1) Distributed Data Processing
2) Best practices Software engineering
3) DCI = Federation of Administrative Domains
4) Middleware Stacks for Distributed Data
Processing
5) Standards for Distributed Data Processing
Summary
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
3/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 3
1) Distributed Data Processing
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
4/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 4
1.1) Scope of Data Processing
Data processing is much more than computing :
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
5/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 5
1.1) Scope of Data Processing
Data processing is much more than computing :
Data acquisition
Instrument management
Data, Metadata management
Access rights to data
Computing
Filtering, Sorting Graphical display
Long term curation, ...
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
6/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 6
1.2) Data Processing on a User Device
On any single user device (PC, Mobile phone, ...) :
Access rights are very simple,
But storage and computing power are limited.
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
7/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 7
1.3) Data Processing needs to be distributed
Depending on the workflow of data processing, infrastructure types are more or lessappropriate, with a tradeoff between development effort, performance and cost :
Costly supercomputersare mandatory for workflows requiring huge corememory and/or highly coupled parallelism.
Development effort is high for efficient highly coupled parallelism.
DCIssuch as EGI and OSG are optimal for secure data sharing, and forworkflows using large data sets with medium core memory and smallparallelism. Development effort is low for single applications.
Desktop gridsare optimal for workflows requiring heavy uncoupled
parallelism, moderate core memory and moderate data sets (< 100 MB).In the near future, desktop grids will also permit secure storage of datasets of any size, but transfer rate will stay limited.Development effort for single applications highly depends on the grid
middleware (high for BOINC, low for XtremWeb).
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
8/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 8
2) Best practices Software engineering
WHO ?
WHAT ?
WHEN ?
WHERE ?
HOW ?
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
9/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 9
2.1) Bad practices to avoid
In order to harness the complexityof distributed data processing:
Beware of buzzwords Avoid meaninglesspretty pictures
Passive sentences should also be avoided :Each of us is responsible for avoiding passive sentences.
Beware
of Clouds
Beware of
Virtualization
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
10/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 10
2.2) Best practices Software engineering
In order to harness the complexityofdistributed data processing:
Best practicesand Software engineeringalready exist :
UML http://www.uml.org/
CMMI-DEV http://www.sei.cmu.edu/library/abstracts/reports/10tr033.cfm
ITIL http://www.itil-officialsite.com/AboutITIL/WhatisITIL.aspx
etc
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
11/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 11
2.3) Best practices Simple questions
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
12/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 12
2.4) Software engineering UML
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
13/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 13
2.5) Software engineering First steps
Using Best practices andSoftware engineeringtoharness the complexity ofdistributed data processing:
Clearly distinguishanddescribe relationshipsbetween concepts
An extensive glossaryis availableat http://www.ogf.org/documents/GFD.181.pdf
A use case collectionis availableat http://www.ogf.org/documents/GFD.180.pdf
Robustness is an intrinsic property of a resource.
Virtualizationpermits flexible on demandmanagement of underlying resources, but only if theseunderlying resources are already robust, and at the costof the virtualization layer, which may take months toacquire robustness.
Administrative domainscontain a repositorypermitting easy authentication and authorization of users.
Grid computingis a federation of administrativedomains, based on mutual trust, some common
operational setup (in particular the trust anchor forauthentication and authorization), specialized interopera-ble middleware stacks, and permit secure data sharing.
Cloud Computingis the on demand management(allocation, instantiation and usage) of internal orexternal computing resources inside a given administrati-
ve domain; but it does currently not span across multipleadministrative domains.
Interoperability between clouds does not existyet, as it requires some common cloud operational setupand interoperable cloud middleware stacks.Work is going on federation of clouds, for instance inhttp://contrail-project.eu
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
14/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 14
2.6) Software engineering Users
End Usersof DCIs arescientists, with various ICT,Grid and Cloud knowledge.For example :
Application developers,
Experienced application users,
Scientistswith noICTknowledge using a scientificportal,
...
Direct Usersof DCIs are various : Developersof scientific applications,
Integratorsof scientific applicationsfor grids,
Providers of scientific workflowengines,
Providers of scientific portals,
Providers of SAGA,
Site Administrators,
VO Administrators,
In order to assess User Needs, take care to track all Users:
Users having ICT knowledgeare able to express, criticize and validate soundneeds and requirements.
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
15/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 15
3) DCI = Federation ofAdministrative Domains
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
16/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 16
3.1) Administrative Domain Principle
Each organization (University, ResearchInstitute, Administration, Enterprise, ...)has in general an Administrative Domain
managed by an operational team andprotected by a firewall.
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
17/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 17
3.1) Administrative Domain Principle
Each organization (University, ResearchInstitute, Administration, Enterprise, ...)has in general an Administrative Domain
managed by an operational team andprotected by a firewall.
This Administrative Domain permits thateach registered user can use any
shared resourceof the AdministrativeDomain following defined access rights.
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
18/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 18
3.2) Payload and Support Functionalities
The end user isprimarily interestedin 'payload'functionalities :
Data Acquisition
Instrument
Management
Data, Metadata
Management
Computing
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
19/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 19
3.2) Payload and Support Functionalities
The end user isprimarily interestedin 'payload'functionalities :
Data Acquisition
Instrument
Management
Data, Metadata
Management
Computing
But management of the sharedresources of an Administrative Domainalso requires 'support' functionalities :
Security
Information (Publication of resourcecapabilities so that they can bedynamically discovered)
Monitoring
Logging and Bookkeeping
Accounting
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
20/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 20
3.2) Payload and Support Functionalities
The end user isprimarily interestedin 'payload'functionalities :
Data Acquisition
Instrument
Management
Data, Metadata
Management
Computing
But management of the sharedresources of an Administrative Domainalso requires 'support' functionalities :
Security
Information (Publication of resourcecapabilities so that they can bedynamically discovered)
Monitoring
Logging and Bookkeeping
Accounting
These functionalities are provided more or less completely andconsistently by access and sharing middleware.
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
21/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 21
3.3) Administrative Domain :
Access and sharing methods
Inside anAdministrativeDomain, anyaccess or sharingmiddleware maybe used :
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
22/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 22
3.3) Administrative Domain :
Access and sharing methods
BitDew
CORBA
File server
Database server
Remote login (SSH, RDP)
Batch system
Applicative portal
Private grid (Condor,BOINC, XtremWeb, ...)
Private cloud
...
Inside anAdministrativeDomain, anyaccess or sharingmiddleware maybe used :
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
23/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 23
3.3) Administrative Domain :
Access and sharing methods
Inside anAdministrativeDomain, anyaccess or sharing
middleware maybe used :
Standardization isoptional.
BitDew
CORBA
File server
Database server
Remote login (SSH, RDP)
Batch system
Applicative portal
Private grid (Condor,BOINC, XtremWeb, ...)
Private cloud
...
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
24/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 24
Commercial
Cloud
3.4) Clouds
Private
Cloud
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
25/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 25
Commercial
Cloud
3.4) Clouds provide on-demand resources
A cloud provide on-demand resources with a richand consistent set of data processingfunctionalities, and a powerful and consistentinterface, but limited to ONE cloud provider (nointeroperability).
It permits flexible extension of an AdministrativeDomain.
Cost of a commercial cloud is affordable forpermanent data storage and computing, but veryhigh for the transfer of large amount of data.
Currently, clouds do NOT manage federation ofAdministrative Domains.Work is going on federation of clouds, for instancein http://contrail-project.eu
Private
Cloud
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
26/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 26
3.5) DCI = Federation of Administrative Domains
The goal is sharing of data.
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
27/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 27
3.5) DCI = Federation of Administrative Domains
The goal is sharing of data.The principle is mutual trust.
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
28/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 28
3.5) DCI = Federation of Administrative Domains
The method is agreement oncommon middleware stacksand operational procedures.
Resources stay locally managed.
The goal is sharing of data.The principle is mutual trust.
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
29/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 29
3.5) DCI = Federation of Administrative Domains
The goal is sharing of data.The principle is mutual trust.
The method is agreement oncommon middleware stacksand operational procedures.
Resources stay locally managed.
The middleware providesconsistent access rightsto all users according to
their DCI credentials.
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
30/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 30
3.5) DCI = Federation of Administrative Domains
Some scientific, academic and research organizations, which alreadyown data related resources, increasingly need to securely sharethese resources with others.
In order to federate their private resources into a productioninfrastructure, these organizations have had to establish mutual
trust, adopted compatible middleware stacks and proceduresintegrated through operations teams to bring their resourcestogether into a Distributed Computing Infrastructure (DCI).
The recurring feature of the various DCIs that offer productionresources (e.g. EGI, DEISA, PRACE, etc.) is that each one integratesmultiple locally managed administrative domains into a usable
environment. The middleware deployed by each DCI provides its users, according
to their DCI credentials, consistent access rights to all resourcesmanaged by that DCI.
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
31/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 31
4) Middleware Stacks forDistributed Data Processing
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
32/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 32
4.1) Various DCI middleware stacks
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
33/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 33
4.2) DCI middleware stacks Poor interoperability
This diagram was createdby Morris RIEDEL (Jlich
Supercomputing Centre)
Each area in
white is a gridinfrastructure,
now called DCI.
Interoperability,
as shown ingreen and red,
is desired.
But it is not
achieved yet.
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
34/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 34
5) Standards for Distributed Data Processing
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
35/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 35
5.1) Identity of a User on a DCI
Each DCI spreads across several Administrative Domains.
So (login, password) does NOT work anymore.
Each user must be identified by a certificate.
Each middleware component using (login, password) must be
adapted. In the USA, the InCommon federation provides SSL
certificates.
X509 certificates are the most robust : Such certificates are provided to a user, after approval of his/her
organization, by a Certificate Authority.
Each Administrative Domain must maintain a copy of the list ofCertificate Authorities.
The list of CAs published by IGTF (International Grid Trust Federation)is a de facto standard.
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
36/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 36
5.2) Security Layer of DCI Middleware Stacks
If you are lucky, all software components using(login, password) are maintained in a consistentsecurity layer.
This security layer must now handle X509certificates.
This permit any user, depending on his/her access
rights, to use shared resources or share his/her ownones. In particular, he/she can submit jobs (tasks,activities) on remote resources.
But a job executed on a remote resource often has
to process data in another location, with the accessrights of the job submitter.
This is a complicated problem called delegation,
which the security layer also has to solve. Each instance of this security layer must work
consistently in each of the various AdministrativeDomains of one given DCI, and this requires verycareful standardization.
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
37/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 37
5.3) GLUE 2.0 (OGF) permits to describe DCI entities
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
38/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 38
5.3) GLUE 2.0 (OGF) permits to describe DCI entities
An Endpointaccepts anActivity(Job or Data) following
theAccessPolicydefined forthe UserDomain(VO, VRC)
containing the user havingsubmitted theActivity.
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
39/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 39
5.3) GLUE 2.0 (OGF) permits to describe DCI entities
An Endpointaccepts anActivity(Job or Data) following
theAccessPolicydefined forthe UserDomain(VO, VRC)
containing the user havingsubmitted theActivity.
ThisActivityis allocated to the
Sharedefined by theMappingPolicyfor the
UserDomain, and transmittedto a Manager(Batch system)
for execution on a Resource(consistent set of physicalmachines).
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
40/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 40
5.3) GLUE 2.0 (OGF) permits to describe DCI entities
An Endpointaccepts anActivity(Job or Data) following
theAccessPolicydefined forthe UserDomain(VO, VRC)
containing the user havingsubmitted theActivity.
ThisActivityis allocated to the
Sharedefined by theMappingPolicyfor the
UserDomain, and transmittedto a Manager(Batch system)
for execution on a Resource(consistent set of physicalmachines).
A Service, managed inside anAdminDomain, aggregates a
consistent set of Endpoints,
Sharesand Managers.
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
41/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 41
5.3) GLUE 2.0 (OGF) permits to describe DCI entities
An Endpointaccepts anActivity(Job or Data) following
theAccessPolicydefined forthe UserDomain(VO, VRC)
containing the user havingsubmitted theActivity.
ThisActivityis allocated to the
Sharedefined by theMappingPolicyfor the
UserDomain, and transmittedto a Manager(Batch system)
for execution on a Resource(consistent set of physicalmachines).
A Service, managed inside anAdminDomain, aggregates a
consistent set of Endpoints,
Sharesand Managers.
The Serviceshould keeppersistentActivityRecords
(Logging and Accounting), butthis is notwell standardized yet. ActivityRecord
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
42/46
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
43/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 43
5.5) Standardization of Payload Functionalities
Data: GridFTP, ByteIO, SRM,
DMI (OGF), CDMI (SNIA)
Instrument: DORII (nota standard yet)
Cloud: OCCI (OGF)
Job: JSDL, OGSA-BES,HPC profiles (OGF)
These standards exist, but are notconsistent
between each other, nor consistent withsupport standards.
They should be consolidated using GLUE 2.0
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
44/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 44
5.6) Useful official and de facto standards for DCIs
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
45/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 45
5.7) Problem of the backend interfaces
For each functionality, it is necessary to standardize the client interface.
But interoperability between different DCIs requires usage of backendinterfaces permitting access to resources.
Then, these backend interfaces also have to be standardized.
For each functionality, I see only two solutions :
The standard describing the client interface also describes the backend interface
(this is heavy and monolithic)
We have to completely separate the definition of the data flowing through the
interface and the definition of the dialog protocol, and/or define a specific
standard for the backend interface (this generates a lot of standards to create
and maintain)
See next slide for the concept. See http://forge.gridforum.org/sf/go/doc15980?nav=1 for the complexity.
Could anybody propose a solution to this problem ?
8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs
46/46
Software Engineering for DistributedData Processing Infrastructures (DCIs)
Authors : E. Urbah, O. Lodygenskyv1.4 46
5.7) Problem of the backend interfaces