20
INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the JRA1 IT-CZ cluster

INFSO-RI-508833 Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the

Embed Size (px)

Citation preview

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

Logging and Bookkeepingand Job Provenance ServicesLudek Matyska (CESNET)

on behalf of the JRA1 IT-CZ cluster

2

Enabling Grids for E-sciencE

INFSO-RI-508833

Talk Outline

• Logging and bookkeeping (L&B)– General overview and main features– Deployment– Expected use

• Job Provenance (JP)– Motivation– Overview and relationship with L&B– Expected use

• Conclusion

3

Enabling Grids for E-sciencE

INFSO-RI-508833

Logging and Bookkeeping

• Motivation– Keep track of Grid jobs

• General overview– Capture job control flow– Provide job state information– Just in time or short-term post mortem analysis– Support user generated events

4

Enabling Grids for E-sciencE

INFSO-RI-508833

General architecture

5

Enabling Grids for E-sciencE

INFSO-RI-508833

Features

• L&B events as important points in the flow control of job– Submission– Transfer between components– Match making and brokerage results– Starting/finishing job execution– Events generated directly by user

Only during the actual job execution

• Events delivered in non-blocking way but reliably• Job state computed by fault tolerant state machine

6

Enabling Grids for E-sciencE

INFSO-RI-508833

User interaction

• Implicit:– Submitting a job

• Explicit– Logging events during job execution– Querying the bookkeeping server

Predefined set of common queries– Directly available through the UI

• Public API to access bookkeeping server– More general, for complex queries– User can register to receive a notification about job state

• Both reject “dangerous” queries• Support for aggregated information about DAGs

7

Enabling Grids for E-sciencE

INFSO-RI-508833

Interaction overview

8

Enabling Grids for E-sciencE

INFSO-RI-508833

User events

• Users can store events in the bookkeeping DB– Non-blocking reliable mechanism for passing job related

information

• Information is available through the L&B querying mechanism– Through the UI or public API

• Still asynchronous– Events from the same CE will usually arrive in correct

order– Internal and user issued timestamps may help

9

Enabling Grids for E-sciencE

INFSO-RI-508833

L&B deployment

• EGEE– Around 50 production installations of bookkeeping

servers– Over 20 000 jobs per day on average– Over 60 GB of data since January 2005

• Other projects using EDG or EGEE middleware– LCG– CrossGrid

10

Enabling Grids for E-sciencE

INFSO-RI-508833

L&B Use

• Provision of job state– Including notification– Feed into R-GMA

• Provision of more detailed info about job flow• Debugging

– Transfer between components, failure trace

• Statistics (JRA2)– Time of submission, execution start and end– Matchmaking results, reasons for no match found– Failures

• End user events– E.g. visualization of progress of job execution

11

Enabling Grids for E-sciencE

INFSO-RI-508833

Job Provenance

• Motivation– The information about jobs has longer value

E.g. repeat a submission of a job executed year ago

– The information about job control flow and job execution environment complements job results E.g. to be able to reliably resubmit a job

• Job Provenance– Preserve information about Grid jobs– Allow data-mining in this information– Assist job re-submission

12

Enabling Grids for E-sciencE

INFSO-RI-508833

JP with WMS and L&B

13

Enabling Grids for E-sciencE

INFSO-RI-508833

JP Gathered Data

• Data from L&B• Job inputs

– The input sandbox– No copies of files in remote storage

However, file/collection identification is available

• Execution track– Data (“measurements”) from CE

Installed software versions, environment, …

– Accounting data DGAS

• User annotations• Scalability

– Record volatile data only

14

Enabling Grids for E-sciencE

INFSO-RI-508833

Primary Data in JP

• Job is the primary entity• Minimal set of core attributes:

– JobID, owner, registration time

• Short data items: tags– “key = value” pairs

• Bulk data: uploaded files

15

Enabling Grids for E-sciencE

INFSO-RI-508833

JP Job Attributes

• A way to provide a generic unified view on any job data– Multivalued– Format: “namespace:key = value”– Namespaces may have defined schema

• User annotations are mapped directly to Job Attributes

• File-type specific plugins– Process bulk files

• Job Attributes used both for internal handling and user queries

16

Enabling Grids for E-sciencE

INFSO-RI-508833

JP Main Components

• Primary storage– Where the data are stored “forever”

• Index server

17

Enabling Grids for E-sciencE

INFSO-RI-508833

JP Primary Storage

• Gather and store data• Process “bulk files” on demand to extract attributes• Interaction with users:

– Annotate– Retrieve job attributes, download files– Always keyed by JobID only

Performance and scalability

• Web service control interface• gsiftp for file transfer

18

Enabling Grids for E-sciencE

INFSO-RI-508833

JP Index Server

• To provide scalability for access• Created and configured for a particular purpose

– Set of Primary servers to register with– Conditions on jobs to retrieve

Job from VO A submitted after January 1st, 2006

– List of attributes to collect

• Only fraction of data from Primary storage• Incremental feed from Primary storage

– Batch feed also available (e.g. after a crash)

• Complex user queries– May refer only to the IS configured attributes

19

Enabling Grids for E-sciencE

INFSO-RI-508833

JP – Current Status

• Prototype implementation– Included in gLite 1.5– Limited IS configuration– Supported files:

L&B and input sandbox

• Plans– Available from GUI– Complex authorization (VOMS based)– Support for re-submission of jobs

20

Enabling Grids for E-sciencE

INFSO-RI-508833

Conclusion

• Job centric monitoring approach– Users and their jobs– User specific data (annotations)– Infrastructure information specific

• Logging and bookkeeping: production– Information gathered and provided when job within the

Grid– Generic interfaces (including web service interface)– Security from the scratch (VOMS authorization)

• Job Provenance: prototype– Permanent job related information storage– Data-mining over complex job sets