33
WORKS08, Austin, Texas, November 17th, 2008 WORKS08, Austin, Texas, November 17th, 2008 Monitoring Monitoring Infrastructure for Grid Infrastructure for Grid Scientific Workflows Scientific Workflows Institute of Computer Institute of Computer Science and ACC CYFRONET Science and ACC CYFRONET AGH AGH Kraków, Poland Kraków, Poland Bartosz Baliś Bartosz Baliś , Marian Bubak , Marian Bubak

WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

Embed Size (px)

Citation preview

Page 1: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Monitoring Infrastructure for Monitoring Infrastructure for Grid Scientific WorkflowsGrid Scientific Workflows

Institute of Computer Science Institute of Computer Science and ACC CYFRONET AGHand ACC CYFRONET AGH

Kraków, PolandKraków, Poland

Bartosz BaliśBartosz Baliś, Marian Bubak, Marian Bubak

Page 2: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

OutlineOutline

• Challenges in Monitoring of Grid Scientific WorkflowsChallenges in Monitoring of Grid Scientific Workflows

• GEMINI infrastructureGEMINI infrastructure

• Event model for workflow execution monitoringEvent model for workflow execution monitoring

• On-line workflow monitoring supportOn-line workflow monitoring support

• Information model for recording workflow executionsInformation model for recording workflow executions

Page 3: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Motivation

• Monitoring of Grid Scientific Workflows important Monitoring of Grid Scientific Workflows important in particularly many scenariosin particularly many scenariosOn-line & off-line performance analysis, dynamic resource

reconfiguration, on-line steering, performance optimization, provenance tracking, experiment mining, experiment repetition, …

• Consumers of monitoring data: humans Consumers of monitoring data: humans (provenance) and processes(provenance) and processes

• On-line & off-line scenariosOn-line & off-line scenarios

• Historic records: provenance, retrospective Historic records: provenance, retrospective analysis (enhancement of next executions)analysis (enhancement of next executions)

Page 4: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Grid Scientific Workflows

• Traditional scientific applicationsTraditional scientific applicationsParallelHomogeneousTightly coupled

• Scientific worfklowsScientific worfklowsDistributedHeterogeneousLoosely Coupled Legacy applications often in the backendsGrid environment

Challenges for monitoring ariseChallenges for monitoring arise

Page 5: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

ChallengesChallenges

• Monitoring infrastructure that conceals workflow Monitoring infrastructure that conceals workflow heterogeneityheterogeneityEvent subscription and instrumentation requests

• Standardized event model for Grid workflow Standardized event model for Grid workflow executionexecutionCurrently events tightly coupled to workflow

environments

• On-line monitoring supportOn-line monitoring supportExisting Grid information systems not suitable for

fast notification-based discovery

• Monitoring information model to record executionsMonitoring information model to record executions

Page 6: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

GEMINI: monitoring infrastructureGEMINI: monitoring infrastructure

• Standardized, abstract Standardized, abstract interfacesinterfaces for for subscription and instrumentationsubscription and instrumentation

• Complex Event ProcessingComplex Event Processing: subscription management via continuous : subscription management via continuous queryingquerying

• Event representationEvent representation XML: self describing, extensible but poor performance Google protocol buffers: under investigation

• MonitorsMonitors: query & sub : query & sub engine, event caching, engine, event caching, servicesservices

• SensorsSensors: lightweight : lightweight collectors of eventscollectors of events

• MutatorsMutators: manipulation : manipulation of monitored entities (e.g. of monitored entities (e.g. dynamic instrumentation)dynamic instrumentation)

Page 7: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

OutlineOutline

• Event model for workflow execution monitoringEvent model for workflow execution monitoring

• On-line workflow monitoring supportOn-line workflow monitoring support

• Information model for recording workflow executionsInformation model for recording workflow executions

Page 8: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Workflow execution eventsWorkflow execution events

• Motivation: capture commonly used monitoring Motivation: capture commonly used monitoring measurements concerning workflow executionmeasurements concerning workflow execution

• Attempts to standardize monitoring events exist, Attempts to standardize monitoring events exist, but oriented to resource monitoringbut oriented to resource monitoringGGF DAMED ‘Top N’GGF NMWG Network Peformance Characteristics

• Typically monitoring systems introduce a single Typically monitoring systems introduce a single event type for application eventsevent type for application events

Page 9: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Workflow Execution Events – taxonomy

• Extension of GGF DAMED Top N eventsExtension of GGF DAMED Top N events

• Extensible hierarchy; example extensions:Extensible hierarchy; example extensions: Loop entered – started.codeReigon.loop MPI app invocation – invoking.application.MPI MPI Calls – started.codeRegion.call.MPISend Application-specific events

• Events for initiators and performersEvents for initiators and performers Invoking, invoked; started, finished

• Event for various execution levelsEvent for various execution levels Workflow, task, code region, data operations

• Events for various execution statesEvents for various execution states Failed, suspended, resumed, …

• Events for execution metricsEvents for execution metrics Progress, rate

software execution

status

finished

started

call

application

process

codeRegion

workflow

wfTask

availabilitydataRead

dataWrite

dataRead

dataWrite

rate

failed

changed

suspended

resumed

progress

invoking

invoked

call

process

codeRegion

wfTask

Page 10: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

OutlineOutline

• Event model for workflow execution monitoringEvent model for workflow execution monitoring

• On-line workflow monitoring supportOn-line workflow monitoring support

• Information model for recording workflow executionsInformation model for recording workflow executions

Page 11: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

On-line Monitoring of Grid WorkflowsOn-line Monitoring of Grid Workflows

• MotivationReaction to time-varying resource availability and application

demandsUp-to-date execution status

• Typical scenario: ‘subscribe to all execution events related to workflow Wf_1234’Distributed producers, not known apriori

• Prerequisite: automatic resource discovery of workflow componentsNew producers are automatically discovered and transparently

receive appropriate active subscription requests

Page 12: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Resource discovery in workflow monitoring

• Challenge: complex execution life cycle of a Grid workflowChallenge: complex execution life cycle of a Grid workflow Abstract workflows: mapping of tasks to resources at runtime Many services involved: enactment engines, resource brokers, schedulers,

queue managers, execution managers, … No single place to subscribe for notifications about new workflow

components Discovery for monitoring must proceed bottom-up: (1) local discovery, (2)

global advertisement, (3) global discovery

• Problem: current Grid information services are not suitableProblem: current Grid information services are not suitable Oriented towards query performance Slow propagation of resource status changes Example: average delay from event ocurrence to notification in EGEE

infrastructure ~ 200 seconds (Berger et al., Analysis of Overhead and Waiting Times in the EGEE production Grid)

Page 13: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Resource discovery: solutionResource discovery: solution• What kind of resource discovery is required?What kind of resource discovery is required?

Identity-based, not attribute-based Full-blown information service functionality not needed Just simple, efficient key-value store

• Solution: a DHT infrastructure Solution: a DHT infrastructure federatedfederated with the monitoring with the monitoring infrastructure infrastructure to store shared state of monitoring servicesto store shared state of monitoring services Key = workflow identifier Value = producer record (Monitoring service URL, etc.) Multiple values (= producers) can be registered

• Efficient key-value stores Efficient key-value stores OpenDHT Amazon Dynamo: efficiency, high availability, scalability. Lack of

strong data consistency (‘eventual consistency’) Avg get/put delay ~ 15/30ms; 99th percentile ~ 200/300ms (Decandia et al. Dynamo: Amazon’s Highly

Available Key-value Store)

Page 14: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Monitoring + DHT (simplified architecture)

Consumer

Workflow Enactment Engine

Site C

Workflow Task

Monitor

Workflow plan

subscribe (wf1)

Monitoring events

Site B

Workflow Task

Monitor

Site A

Workflow Task

Monitor

DHT

Page 15: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

DHT-based scenario

C0 : Client M2 : MonitorM0 : Monitor M1 : Monitor

get(wf1)

put(wf1,m2)

subscribe(wf1,C0)

push(event)

push(event)

B0 : DHT

put(wf1,m1)

m1

subscribe(wf1)

push(archiveEvents)

subscribe(wf1,C0)

push(event)

get(wf1)

m1,m2

push(archiveEvents)

Page 16: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Evaluation• Goal:

Measure performance & scalabilityComparison with centralized approach

• Main characteristic measured:Delay between ocurrence of a new workflow component to

beginning of data transfer, for different workloads

• Two methodologies:Queuing Network models with multiple classes, analytical

solutionSimulation models (CSIM simulation package)

Page 17: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

1st methodology: Queuing NetworksCustomers

arrivingCustomers

leaving

queue server

Load-independent Load-dependent Delay

M2

M1

Mm

C2

C1

Cn

DHTPoll

get

put

subscribe

register

begin transfer

M2

M1

Mm

C2

C1

Cn

CR

begin transfer

subscribe

I’m producer

register

get main producer

• Solved analitycally

(a) DHT solution QN model (b) Centralized solution QN model

Page 18: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

2nd methodology: discrete-event simulation

• CSIM simulation package

Page 19: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Input parameters for models

• Workload intensityMeasured in job arrivals per secondTaken from EGEE: 3000 to 100.000 jobs per day

Large scale production infrastructure

Assumed range: from 0.3 to 10 job arrivals per second

• Service demandsMonitors and Coordinator: prototypes built and measuredDHT: from available reports on large-scale deployments

OpenDHT, Amazon Dynamo

Page 20: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Service demand matrices

Page 21: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Results (centralized model)

Page 22: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Results (DHT model)

Page 23: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Scalability comparison: centralized vs. DHT

Conclusion: DHT solution scalable as expected, but centralized solution can still handle relatively large workloads before saturation

Page 24: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

OutlineOutline

• Event model for workflow execution monitoringEvent model for workflow execution monitoring

• On-line workflow monitoring supportOn-line workflow monitoring support

• Information model for recording workflow executionsInformation model for recording workflow executions

Page 25: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Information model for wf execution records

• Motivation: need for structured information about past experiments Motivation: need for structured information about past experiments executed as scientific workflows in e-Science environmentsexecuted as scientific workflows in e-Science environments Provenance querying Mining over past experiments Experiment repetition Execution optimization based on history

• State of the artState of the art Monitoring information models do exist but for resource monitoring

(GLUE), not execution monitoring Provenance models are not sufficient Repositories for performance data are oriented towards event traces or

simple performance-oriented information

• Experiment Information (ExpInfo) model Ontologies used to describe the model and represent the records

Page 26: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

ExpInfo model

1

*

*

+operationName+cpuUsage+memoryUsage

Computation

+query

DataAccess

+time+duration

ExecutionStage

+name+executedBy+sourceFile+id+version+time+duration

Experiment

+endpoint

GridObjectInstance

+meaning+name

GridObject

DataEntity

+inputData*

+usedInstance

+storageMechanism+time+format+size+creator

ExperimentResult

+obtainedResult

*

+technology

GridObjectImplementation

*

DataEntity

+outputData

*

+region+usedRuleSet

NewDrugRanking

VirusNucleotideMutationDrugRanking

-alignmentRegion

NucleotideSequenceAlignment

+testedMutation

+resultRanking

-resultSubtype

NucleotideSequenceSubtyping

VirusNucleotideSequence

-subtypedNs

-resultMutation

+alignedNs

ViroLab computation ontology Generic in silico experiment ontology

DRS Data ontology

Computation ontology

A simplified example with particular domain ontologies

Page 27: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

ExpInfo model: set of ontologiesExpInfo model: set of ontologies

• General experiment informationGeneral experiment informationPurpose, execution stages, input/output data sets

• Provenance informationProvenance informationWho, where, why, data dependencies

• Performance informationPerformance informationDuration of computation stages, scheduling, queueing,

performance metrics (possible)

• Resource informationResource informationPhysical resources (hosts, containers) used in the

computation

• Connection with domain ontologiesConnection with domain ontologiesData sets with Data ontologyExecution stages with Application ontology

Page 28: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Aggregation of data to informationAggregation of data to information

• From monitoring events to ExpInfo recordsFrom monitoring events to ExpInfo records

• Standardized process described by Standardized process described by aggregation rulesaggregation rules and and derivation rulesderivation rules

• Aggregation rules specify how to instantiate individualsAggregation rules specify how to instantiate individuals Ontology classes associated with aggregation rules through object

properties

• Derivation rules specify how to compute attributes, including Derivation rules specify how to compute attributes, including object properties = associations betwen individualsobject properties = associations betwen individuals Attributes are associated with derivation rules via annotations

• Semantic AggregatorSemantic Aggregator uses collects wf execution events and uses collects wf execution events and produces ExpInfo records according to aggregation and derivation produces ExpInfo records according to aggregation and derivation rulesrules

Page 29: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Aggregation rulesAggregation rules

ExperimentAggregation:ExperimentAggregation:eventTypes = started.workflow, finished.workflowinstantiatedClass = http://www.virolab.org/onto/exp-

protos/ExperimentecidCoherency = 1

ComputationAggregation:ComputationAggregation:eventTypes = invoking.wfTask, invoked.wfTaskinstantaitedClass = http://www.virolab.org/onto/exp-

protos/ComputationacidCoherency = 2

Page 30: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Derivation rulesDerivation rules

The simplest case – an XML element mapped directly to a The simplest case – an XML element mapped directly to a functional property:functional property:

<Derivation ID="<Derivation ID="OwnerLoginDerivOwnerLoginDeriv">"> <element><element>MonitoringData/experimentStarted/ownerLoginMonitoringData/experimentStarted/ownerLogin</element></element></ext-ns:Derivation></ext-ns:Derivation>

More complex case: which XML elements are needed and how to More complex case: which XML elements are needed and how to compute an attribute:compute an attribute:

<Derivation ID=<Derivation ID=”DurationDeriv””DurationDeriv”>> <delegate>cyfronet.gs.aggregator.delegates.ExpPlugin</delegate><delegate>cyfronet.gs.aggregator.delegates.ExpPlugin</delegate> <delegateParam>software.execution.started.application/time</><delegateParam>software.execution.started.application/time</> <delegateParam>software.execution.finished.application/time</><delegateParam>software.execution.finished.application/time</></ext-ns:Derivation></ext-ns:Derivation>

Page 31: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Applications• Coordinated Traffic

Management Executed within K-Wf Grid

infrastructure for workflows Workflows with legacy

backends Instrumentation & tracing

• Drug Resistance application Executed within ViroLab

virtual laboratory for infectious diseasesvirolab.cyfronet.pl

Recording executions, provenance querying, visual ontology-based querying based on ExpInfo model

Page 32: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Conclusion

• Several monitoring challenges specific to Grid scientific workflows

• Standardized taxonomy for workflow execution events

• DHT infrastructure to improve performance of resource discovery and enable on-line monitoring

• Information model for recording workflow executions

Page 33: WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH

WORKS08, Austin, Texas, November 17th, 2008WORKS08, Austin, Texas, November 17th, 2008

Future WorkFuture Work

• Enhancement of event & information modelsEnhancement of event & information modelsWork-in-progress, requires extensive review of existing

systems to enhance event taxonomy, event data structures and information model

• Model enhancement & validationModel enhancement & validationPerformance of large-scale deploymentClassification of workflows w.r.to generated workloads

(Preliminary study: S. Ostermann, R. Prodan, and T. Fahringer. A Trace-Based Investigation of the Characteristics of Grid Workflows)

• Information model for worfklow statusInformation model for worfklow statusSimilar to resource status in information systems