16
Active Data: Data Life Cycle Management Across Heterogeneous Systems and Infrastructures Anthony Simonet, Gilles Fedak (INRIA) Matei Ripeanu, Samer Al-Kiswany (UCB) Kyle Chard, Ian Foster (ANL/UC) Hot Topics in High-Performance Distributed Computing Workshop IBM Almaden Research Center San Jose, California March 12, 2015 1/13 G. Fedak() Active Data March 12, 2015

Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

Embed Size (px)

DESCRIPTION

Active Data : Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures The Big Data challenge consists in managing, storing, analyzing and visualizing these huge and ever growing data sets to extract sense and knowledge. As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key point is to handle the complexity of the 'Data Life Cycle', i.e. the various operations performed on data: transfer, archiving, replication, deletion, etc. Indeed, data-intensive applications span over a large variety of devices and e-infrastructures which implies that many systems are involved in data management and processing. ''Active Data'' is new approach to automate and improve the expressiveness of data management applications. It consists of * a 'formal model' for Data Life Cycle, based on Petri Net, that allows to describe and expose data life cycle across heterogeneous systems and infrastructures. * a 'programming model' allows code execution at each stage of the data life cycle: routines provided by programmers are executed when a set of events (creation, replication, transfer, deletion) happen to any data.

Citation preview

Page 1: Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

Active Data: Data Life Cycle ManagementAcross Heterogeneous Systems and

Infrastructures

Anthony Simonet, Gilles Fedak (INRIA)Matei Ripeanu, Samer Al-Kiswany (UCB)

Kyle Chard, Ian Foster (ANL/UC)

Hot Topics in High-Performance Distributed Computing WorkshopIBM Almaden Research Center

San Jose, CaliforniaMarch 12, 2015

1/13

G. Fedak() Active Data March 12, 2015

Page 2: Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

Big Data ...

I Huge and growing volume of information originating from multiplesources.

!"#$%&'("$#) *+'&,-./"#)0+1)*2+("2() 34(")5-$-)!"$(%"($)

I . . . or Big Bottlenecks ?I how to scale the infrastructure ?

I end-to-end performance improvement, inter-system optimization.I how to improve productivity of data-intensive scientist ?

I data-oriented programming language, data quality, improveautomation and errors recovery

2/13

G. Fedak() Active Data March 12, 2015

Page 3: Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

Data Life Cycle

Definition

Data Life Cycle (DLC) is the course of operational stages through whichdata pass from the time when they enter a set of systems to the timewhen they leave it.

!"#$%&%'()* +,-.,("-&&%)/* 01(,2/-*

!)234&%&*

!)234&%&*

Challenges :

I Expose high level view DLC across distributed systems andinfrastructures

I Expose interactions between the infrastructure and the DLC (e.gfailures)

3/13

G. Fedak() Active Data March 12, 2015

Page 4: Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

Active Data

Active Data:

I Allow to reason about data sets handled by heterogeneous softwareand infrastructures.

I A formal model that captures the essential life cycle stages andproperties: creation, deletion, faults, replication, error checking . . .

I programming model to develop easily data life cycle managementapplications.

I Allows legacy systems to expose their intrinsic data life cycle.

4/13

G. Fedak() Active Data March 12, 2015

Page 5: Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

Active Data: Principles & Features

System programmers expose their system’s internal data life cycle with amodel based on Petri Nets.A Life Cycle Model is made of

I Places: data states

I Transitions : data operations

•Created

t1

Written

t2

Read

t3

t4

Terminated

public void handler () {

computeMD5 ();

}

Each token has a unique identifier, corresponding to the actual dataitem’s.

5/13

G. Fedak() Active Data March 12, 2015

Page 6: Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

Active Data: Principles & Features

System programmers expose their system’s internal data life cycle with amodel based on Petri Nets.A Life Cycle Model is made of

I Places: data states

I Transitions : data operations

Created

t1

•Written

t2

Read

t3

t4

Terminated

public void handler () {

computeMD5 ();

}

A transition is fired whenever a data state changes.

5/13

G. Fedak() Active Data March 12, 2015

Page 7: Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

Active Data: Principles & Features

System programmers expose their system’s internal data life cycle with amodel based on Petri Nets.A Life Cycle Model is made of

I Places: data states

I Transitions : data operations

Created

t1

•Written

t2

Read

t3

t4

Terminated

public void handler () {

computeMD5 ();

}

Code may be plugged by clients to transitions.It is executed whenever the transition is fired.

5/13

G. Fedak() Active Data March 12, 2015

Page 8: Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

Active Data Framework

6/13

G. Fedak() Active Data March 12, 2015

Life Cycle View

File transferFile Dataset Metadata

Guard

Code Execution

} Tagged Tokens

Notification

Framework features:

I Captures data events in legacy systems

I High-level life cycle-centered view of dataI Single namespace for all the files,

datasets and metadata

I Powerful filters based on Data TagsI Install Taggers on TransitionsI Guarded Transitions : only executes on

token which have specific tags.

I Publish/subscribe transitions

I Custom user reaction to data progressI Custom code executionI Custom notifications (twitter, email,

gdoc, ifttt . . . )

Page 9: Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

Use Case: Advanced Photon Source

Globus Catalog Globus

Detector Local Storage Compute Cluster

1. LocalTransfer

2. ExtractMetadata

3. GlobusTransfer

4. Swift Parallel Analysis

I 3 to 5 TB of data per week on this detector

I Raw data are pre-processed and registered in the Globus Catalog :

I Data are curated by several applications

I Data are shared amongst scientific user

7/13

G. Fedak() Active Data March 12, 2015

Page 10: Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

Data Surveillance Framework

4 goals (that would otherwise require a lot of scripting and hacking):

I Monitoring Data Set Progress

I Better Automation

I Sharing & Notification

I Error Discovery & Recovery

8/13

G. Fedak() Active Data March 12, 2015

Page 11: Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

APS Data Life Cycle Model

Created Start transfer

Terminated

End

Detector

Created

SuccessFailure

SucceededFailed

EndEnd

Terminated

End transfer

Globus transfer

Created

End

Terminated

Start transfer

Shared storage

Created

SuccessFailure

SucceededFailed

EndEnd

Terminated

Start Swift

Globus transfer

CreatedExtract

Update

TerminatedRemove

Globus Catalog

Created

Initialize

Set

End

Failure

Terminated

Derive

Swift

Data life cycle model composed of 6 systems.

9/13

G. Fedak() Active Data March 12, 2015

Page 12: Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

Error Detection & Recovery

10/13

G. Fedak() Active Data March 12, 2015

Example scenario

Recover from system-wide errors: faulty acquired files are detected onlyafter Swift fails to process them.

In this situation, the user manually:

I Drops the whole dataset

I Removes any associated file and metadata

I Re-acquire the dataset using the same parameters

Page 13: Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

E.D.&R. implementationAvalon Daniel Arnaud Anthony Vincent

Use-case: APS data life cycle model

Created Start transfer

Terminated

End

Detector

Created

SuccessFailure

SucceededFailed

EndEnd

Terminated

End transfer

Globus transfer

Created

End

Terminated

Start transfer

Shared storage

Created

SuccessFailure

SucceededFailed

EndEnd

Terminated

Start Swift

Globus transfer

CreatedExtract

Update

TerminatedRemove

Globus Catalog

Created

Initialize

Set

End

Failure

Terminated

Derive

Swift

Data life cycle model composed of 6 systems.

Avalon June 3rd, 2014 22/30

Active Data ClientFailure and Recovery Handler

remove metadata from the catalog

Filters data likely to fail

Tagger

Active Data Client

Guard

Handler

run the Globus catalog UI scripts

11/13

G. Fedak() Active Data March 12, 2015

Page 14: Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

Handler Code

TransitionHandler handler = new TransitionHandler () {

public void handler(Transition t, boolean isLocal , Token[] inTokens , Token[] outTokens) {

// Get the dataset identifier

LifeCycle lc = ad.getLifeCycle(inTokens [0]);

datasetId = lc.getTokens("Shared storage.Created")[0]. getUid ();

// Remove the dataset annotations from the catalog

String url = "https :// catalog.globus.org/dataset/" + datasetId;

Runtime r = Runtime.getRuntime ();

Process p = r.exec("catalog_client.py remove " + url);

p.waitFor ();

// Locally , remove the datasets

String path = "~/aps/" + datasetId;

FileUtils.deleteDirectory(new File(path));

// Publish the "Detector.End"

Token root = lc.getTokens("Detector.Created")[0];

ad.publishTransition("Detector.End", lc);

// Notify the user

sendEmail("[email protected]", "APS - Corrupted dataset " + datasetId);

}

};

HandlerGuard guard = new HandlerGuard () {

public boolean accept ( Transition t , Token [] inTokens , Token [] outTokens ) {

return input [0]. hasTag(" f a i l u r e c o r r u p t e d ");

}}

ad.subscribeTo("Swift.Failure", handler , guard);

12/13

G. Fedak() Active Data March 12, 2015

Page 15: Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

Conclusion

Active Data

I allows to expose Data Life Cycle across heterogeneous systems andinfrastructures

I transition-based programming model for DLC managementapplication

I Monitoring, automation, error detection & recoveryI X-systems optimizations: incremental computing, data staging,

caching, throttling etc. . .

Perspectives :

I Use AD to deploy data management software stack on IaaS (AsmaBen Cheick, Heithem Abbes, Univ. Tunis)

I Big Data Apache stack X-optimization (H. He, CAS, Beijing)

I Volunteer & crowd computing (M. Moca, BBU, Romania)

13/13

G. Fedak() Active Data March 12, 2015

Page 16: Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

Thank you!

Questions?