Software Engineering for Distributed Data Processing Infrastructures DCIs

Embed Size (px)

Citation preview

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    1/46

    The EDGI project receives Community research

    funding under grant agreement 2615561

    Software Engineering for Distributed

    Data Processing Infrastructures (DCIs)

    Etienne URBAH [email protected] LODYGENSKY [email protected]

    LAL, Univ Paris-Sud, IN2P3/CNRS, Orsay, France

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    2/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 2

    1) Distributed Data Processing

    2) Best practices Software engineering

    3) DCI = Federation of Administrative Domains

    4) Middleware Stacks for Distributed Data

    Processing

    5) Standards for Distributed Data Processing

    Summary

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    3/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 3

    1) Distributed Data Processing

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    4/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 4

    1.1) Scope of Data Processing

    Data processing is much more than computing :

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    5/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 5

    1.1) Scope of Data Processing

    Data processing is much more than computing :

    Data acquisition

    Instrument management

    Data, Metadata management

    Access rights to data

    Computing

    Filtering, Sorting Graphical display

    Long term curation, ...

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    6/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 6

    1.2) Data Processing on a User Device

    On any single user device (PC, Mobile phone, ...) :

    Access rights are very simple,

    But storage and computing power are limited.

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    7/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 7

    1.3) Data Processing needs to be distributed

    Depending on the workflow of data processing, infrastructure types are more or lessappropriate, with a tradeoff between development effort, performance and cost :

    Costly supercomputersare mandatory for workflows requiring huge corememory and/or highly coupled parallelism.

    Development effort is high for efficient highly coupled parallelism.

    DCIssuch as EGI and OSG are optimal for secure data sharing, and forworkflows using large data sets with medium core memory and smallparallelism. Development effort is low for single applications.

    Desktop gridsare optimal for workflows requiring heavy uncoupled

    parallelism, moderate core memory and moderate data sets (< 100 MB).In the near future, desktop grids will also permit secure storage of datasets of any size, but transfer rate will stay limited.Development effort for single applications highly depends on the grid

    middleware (high for BOINC, low for XtremWeb).

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    8/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 8

    2) Best practices Software engineering

    WHO ?

    WHAT ?

    WHEN ?

    WHERE ?

    HOW ?

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    9/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 9

    2.1) Bad practices to avoid

    In order to harness the complexityof distributed data processing:

    Beware of buzzwords Avoid meaninglesspretty pictures

    Passive sentences should also be avoided :Each of us is responsible for avoiding passive sentences.

    Beware

    of Clouds

    Beware of

    Virtualization

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    10/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 10

    2.2) Best practices Software engineering

    In order to harness the complexityofdistributed data processing:

    Best practicesand Software engineeringalready exist :

    UML http://www.uml.org/

    CMMI-DEV http://www.sei.cmu.edu/library/abstracts/reports/10tr033.cfm

    ITIL http://www.itil-officialsite.com/AboutITIL/WhatisITIL.aspx

    etc

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    11/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 11

    2.3) Best practices Simple questions

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    12/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 12

    2.4) Software engineering UML

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    13/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 13

    2.5) Software engineering First steps

    Using Best practices andSoftware engineeringtoharness the complexity ofdistributed data processing:

    Clearly distinguishanddescribe relationshipsbetween concepts

    An extensive glossaryis availableat http://www.ogf.org/documents/GFD.181.pdf

    A use case collectionis availableat http://www.ogf.org/documents/GFD.180.pdf

    Robustness is an intrinsic property of a resource.

    Virtualizationpermits flexible on demandmanagement of underlying resources, but only if theseunderlying resources are already robust, and at the costof the virtualization layer, which may take months toacquire robustness.

    Administrative domainscontain a repositorypermitting easy authentication and authorization of users.

    Grid computingis a federation of administrativedomains, based on mutual trust, some common

    operational setup (in particular the trust anchor forauthentication and authorization), specialized interopera-ble middleware stacks, and permit secure data sharing.

    Cloud Computingis the on demand management(allocation, instantiation and usage) of internal orexternal computing resources inside a given administrati-

    ve domain; but it does currently not span across multipleadministrative domains.

    Interoperability between clouds does not existyet, as it requires some common cloud operational setupand interoperable cloud middleware stacks.Work is going on federation of clouds, for instance inhttp://contrail-project.eu

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    14/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 14

    2.6) Software engineering Users

    End Usersof DCIs arescientists, with various ICT,Grid and Cloud knowledge.For example :

    Application developers,

    Experienced application users,

    Scientistswith noICTknowledge using a scientificportal,

    ...

    Direct Usersof DCIs are various : Developersof scientific applications,

    Integratorsof scientific applicationsfor grids,

    Providers of scientific workflowengines,

    Providers of scientific portals,

    Providers of SAGA,

    Site Administrators,

    VO Administrators,

    In order to assess User Needs, take care to track all Users:

    Users having ICT knowledgeare able to express, criticize and validate soundneeds and requirements.

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    15/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 15

    3) DCI = Federation ofAdministrative Domains

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    16/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 16

    3.1) Administrative Domain Principle

    Each organization (University, ResearchInstitute, Administration, Enterprise, ...)has in general an Administrative Domain

    managed by an operational team andprotected by a firewall.

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    17/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 17

    3.1) Administrative Domain Principle

    Each organization (University, ResearchInstitute, Administration, Enterprise, ...)has in general an Administrative Domain

    managed by an operational team andprotected by a firewall.

    This Administrative Domain permits thateach registered user can use any

    shared resourceof the AdministrativeDomain following defined access rights.

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    18/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 18

    3.2) Payload and Support Functionalities

    The end user isprimarily interestedin 'payload'functionalities :

    Data Acquisition

    Instrument

    Management

    Data, Metadata

    Management

    Computing

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    19/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 19

    3.2) Payload and Support Functionalities

    The end user isprimarily interestedin 'payload'functionalities :

    Data Acquisition

    Instrument

    Management

    Data, Metadata

    Management

    Computing

    But management of the sharedresources of an Administrative Domainalso requires 'support' functionalities :

    Security

    Information (Publication of resourcecapabilities so that they can bedynamically discovered)

    Monitoring

    Logging and Bookkeeping

    Accounting

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    20/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 20

    3.2) Payload and Support Functionalities

    The end user isprimarily interestedin 'payload'functionalities :

    Data Acquisition

    Instrument

    Management

    Data, Metadata

    Management

    Computing

    But management of the sharedresources of an Administrative Domainalso requires 'support' functionalities :

    Security

    Information (Publication of resourcecapabilities so that they can bedynamically discovered)

    Monitoring

    Logging and Bookkeeping

    Accounting

    These functionalities are provided more or less completely andconsistently by access and sharing middleware.

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    21/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 21

    3.3) Administrative Domain :

    Access and sharing methods

    Inside anAdministrativeDomain, anyaccess or sharingmiddleware maybe used :

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    22/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 22

    3.3) Administrative Domain :

    Access and sharing methods

    BitDew

    CORBA

    File server

    Database server

    Remote login (SSH, RDP)

    Batch system

    Applicative portal

    Private grid (Condor,BOINC, XtremWeb, ...)

    Private cloud

    ...

    Inside anAdministrativeDomain, anyaccess or sharingmiddleware maybe used :

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    23/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 23

    3.3) Administrative Domain :

    Access and sharing methods

    Inside anAdministrativeDomain, anyaccess or sharing

    middleware maybe used :

    Standardization isoptional.

    BitDew

    CORBA

    File server

    Database server

    Remote login (SSH, RDP)

    Batch system

    Applicative portal

    Private grid (Condor,BOINC, XtremWeb, ...)

    Private cloud

    ...

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    24/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 24

    Commercial

    Cloud

    3.4) Clouds

    Private

    Cloud

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    25/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 25

    Commercial

    Cloud

    3.4) Clouds provide on-demand resources

    A cloud provide on-demand resources with a richand consistent set of data processingfunctionalities, and a powerful and consistentinterface, but limited to ONE cloud provider (nointeroperability).

    It permits flexible extension of an AdministrativeDomain.

    Cost of a commercial cloud is affordable forpermanent data storage and computing, but veryhigh for the transfer of large amount of data.

    Currently, clouds do NOT manage federation ofAdministrative Domains.Work is going on federation of clouds, for instancein http://contrail-project.eu

    Private

    Cloud

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    26/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 26

    3.5) DCI = Federation of Administrative Domains

    The goal is sharing of data.

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    27/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 27

    3.5) DCI = Federation of Administrative Domains

    The goal is sharing of data.The principle is mutual trust.

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    28/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 28

    3.5) DCI = Federation of Administrative Domains

    The method is agreement oncommon middleware stacksand operational procedures.

    Resources stay locally managed.

    The goal is sharing of data.The principle is mutual trust.

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    29/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 29

    3.5) DCI = Federation of Administrative Domains

    The goal is sharing of data.The principle is mutual trust.

    The method is agreement oncommon middleware stacksand operational procedures.

    Resources stay locally managed.

    The middleware providesconsistent access rightsto all users according to

    their DCI credentials.

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    30/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 30

    3.5) DCI = Federation of Administrative Domains

    Some scientific, academic and research organizations, which alreadyown data related resources, increasingly need to securely sharethese resources with others.

    In order to federate their private resources into a productioninfrastructure, these organizations have had to establish mutual

    trust, adopted compatible middleware stacks and proceduresintegrated through operations teams to bring their resourcestogether into a Distributed Computing Infrastructure (DCI).

    The recurring feature of the various DCIs that offer productionresources (e.g. EGI, DEISA, PRACE, etc.) is that each one integratesmultiple locally managed administrative domains into a usable

    environment. The middleware deployed by each DCI provides its users, according

    to their DCI credentials, consistent access rights to all resourcesmanaged by that DCI.

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    31/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 31

    4) Middleware Stacks forDistributed Data Processing

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    32/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 32

    4.1) Various DCI middleware stacks

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    33/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 33

    4.2) DCI middleware stacks Poor interoperability

    This diagram was createdby Morris RIEDEL (Jlich

    Supercomputing Centre)

    Each area in

    white is a gridinfrastructure,

    now called DCI.

    Interoperability,

    as shown ingreen and red,

    is desired.

    But it is not

    achieved yet.

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    34/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 34

    5) Standards for Distributed Data Processing

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    35/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 35

    5.1) Identity of a User on a DCI

    Each DCI spreads across several Administrative Domains.

    So (login, password) does NOT work anymore.

    Each user must be identified by a certificate.

    Each middleware component using (login, password) must be

    adapted. In the USA, the InCommon federation provides SSL

    certificates.

    X509 certificates are the most robust : Such certificates are provided to a user, after approval of his/her

    organization, by a Certificate Authority.

    Each Administrative Domain must maintain a copy of the list ofCertificate Authorities.

    The list of CAs published by IGTF (International Grid Trust Federation)is a de facto standard.

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    36/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 36

    5.2) Security Layer of DCI Middleware Stacks

    If you are lucky, all software components using(login, password) are maintained in a consistentsecurity layer.

    This security layer must now handle X509certificates.

    This permit any user, depending on his/her access

    rights, to use shared resources or share his/her ownones. In particular, he/she can submit jobs (tasks,activities) on remote resources.

    But a job executed on a remote resource often has

    to process data in another location, with the accessrights of the job submitter.

    This is a complicated problem called delegation,

    which the security layer also has to solve. Each instance of this security layer must work

    consistently in each of the various AdministrativeDomains of one given DCI, and this requires verycareful standardization.

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    37/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 37

    5.3) GLUE 2.0 (OGF) permits to describe DCI entities

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    38/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 38

    5.3) GLUE 2.0 (OGF) permits to describe DCI entities

    An Endpointaccepts anActivity(Job or Data) following

    theAccessPolicydefined forthe UserDomain(VO, VRC)

    containing the user havingsubmitted theActivity.

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    39/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 39

    5.3) GLUE 2.0 (OGF) permits to describe DCI entities

    An Endpointaccepts anActivity(Job or Data) following

    theAccessPolicydefined forthe UserDomain(VO, VRC)

    containing the user havingsubmitted theActivity.

    ThisActivityis allocated to the

    Sharedefined by theMappingPolicyfor the

    UserDomain, and transmittedto a Manager(Batch system)

    for execution on a Resource(consistent set of physicalmachines).

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    40/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 40

    5.3) GLUE 2.0 (OGF) permits to describe DCI entities

    An Endpointaccepts anActivity(Job or Data) following

    theAccessPolicydefined forthe UserDomain(VO, VRC)

    containing the user havingsubmitted theActivity.

    ThisActivityis allocated to the

    Sharedefined by theMappingPolicyfor the

    UserDomain, and transmittedto a Manager(Batch system)

    for execution on a Resource(consistent set of physicalmachines).

    A Service, managed inside anAdminDomain, aggregates a

    consistent set of Endpoints,

    Sharesand Managers.

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    41/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 41

    5.3) GLUE 2.0 (OGF) permits to describe DCI entities

    An Endpointaccepts anActivity(Job or Data) following

    theAccessPolicydefined forthe UserDomain(VO, VRC)

    containing the user havingsubmitted theActivity.

    ThisActivityis allocated to the

    Sharedefined by theMappingPolicyfor the

    UserDomain, and transmittedto a Manager(Batch system)

    for execution on a Resource(consistent set of physicalmachines).

    A Service, managed inside anAdminDomain, aggregates a

    consistent set of Endpoints,

    Sharesand Managers.

    The Serviceshould keeppersistentActivityRecords

    (Logging and Accounting), butthis is notwell standardized yet. ActivityRecord

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    42/46

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    43/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 43

    5.5) Standardization of Payload Functionalities

    Data: GridFTP, ByteIO, SRM,

    DMI (OGF), CDMI (SNIA)

    Instrument: DORII (nota standard yet)

    Cloud: OCCI (OGF)

    Job: JSDL, OGSA-BES,HPC profiles (OGF)

    These standards exist, but are notconsistent

    between each other, nor consistent withsupport standards.

    They should be consolidated using GLUE 2.0

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    44/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 44

    5.6) Useful official and de facto standards for DCIs

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    45/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 45

    5.7) Problem of the backend interfaces

    For each functionality, it is necessary to standardize the client interface.

    But interoperability between different DCIs requires usage of backendinterfaces permitting access to resources.

    Then, these backend interfaces also have to be standardized.

    For each functionality, I see only two solutions :

    The standard describing the client interface also describes the backend interface

    (this is heavy and monolithic)

    We have to completely separate the definition of the data flowing through the

    interface and the definition of the dialog protocol, and/or define a specific

    standard for the backend interface (this generates a lot of standards to create

    and maintain)

    See next slide for the concept. See http://forge.gridforum.org/sf/go/doc15980?nav=1 for the complexity.

    Could anybody propose a solution to this problem ?

  • 8/11/2019 Software Engineering for Distributed Data Processing Infrastructures DCIs

    46/46

    Software Engineering for DistributedData Processing Infrastructures (DCIs)

    Authors : E. Urbah, O. Lodygenskyv1.4 46

    5.7) Problem of the backend interfaces