28
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

Embed Size (px)

Citation preview

Page 1: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

David Adams

ATLAS

DIAL: Distributed Interactive Analysis of Large datasets

David Adams

BNL

August 5, 2002

BNL OMEGA talk

Page 2: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 2

David Adams

ATLAS

Contents• Definitions

• Use cases

• Requirements

• Design

• Datasets

• Dataset interface

• Dataset implementation

• Status and conclusions

Page 3: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 3

David Adams

ATLAS

DefinitionsDataset

• Collection of event data– Known event (beam crossing) ID’s

– Same content (raw, reconstructed, summary,…) for each event

– Known luminosity and selection criteria (including triggers)

• Suitable for extracting physical quantities (cross section, limit, etc.)

– Or special data for calibration, alignment or monitoring detector performance

Page 4: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 4

David Adams

ATLAS

Definitions (cont)Large

• Too big to analyze from a single process– Today: 100 GB or more

Analysis• Loop over events and perform the same action

on each– Select events

– Visualize events

– Fill histograms and tuples

– Generate new event data?

Page 5: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 5

David Adams

ATLAS

Definitions (cont)Interactive

• Rapid response– Request processed in seconds, not hours

• Updates if the request is not finished quickly:– Partial results

– Progress meter> % completed

> Time to completion

– Status visualization: what is being processed where

– Able to terminate incomplete requests

Page 6: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 6

David Adams

ATLAS

Definitions (cont)Distributed

• Central process presents results to the user• Processing carried out by multiple jobs• Jobs on different machines and different sites• Motivation:

– Access remote data

– Parallel processing for faster response

Page 7: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 7

David Adams

ATLAS

Use casesEvent data specification

• User defines dataset– which events and which data in each event

• Includes version of data for each event– e.g. jets from reco version 14.2 instead of 13.1

• Restrict visible content of each event– E.g. jets, not tracks

– Reduces cost of data access

• Dataset use as input for processing• Dataset can be recorded and recalled later

Page 8: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 8

David Adams

ATLAS

Use cases (cont)Event loop processing

• Event selection– User provides algorithm to be run on each event

– Result determines if event is included in output dataset

• Fill histogram– User defines histogram and provides algorithm to

fill from data for one event

• Fill tuple– Collection of named variables

– User provides algorithm to fill 0-N times/event

Page 9: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 9

David Adams

ATLAS

Use cases (cont)Single event processing

• Fetch event– Data for selected event returned to user

– User may request a subset of the event data

• Visualization– User defines a “view”

– User specifies an event and the associated data is used to fill the view

Page 10: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 10

David Adams

ATLAS

Use cases (cont)Distributed processing

• Remote processing– Analysis program run on the local node

– Data is located on a remote node

– Job processing data runs on the remote node

– User generates requests on the local node which are run on the remote node with results returned to the local node

• Parallel processing– Dataset divided by event and each dataset is

processed in a separate process or thread

Page 11: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 11

David Adams

ATLAS

Use cases (cont)Distributed processing (cont)

• Multi-node processing– Previous processes are run on different compute

nodes

• Multi-site processing– Previous processes are distributed over different

sites

• GRID processing– Previous uses GRID for job specification,

submission, authentication and monitoring

Page 12: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 12

David Adams

ATLAS

RequirementsUse cases

• Satisfy the preceding use cases

Interactivity• Show status while a request is being processed• Update status once/minute (adjustable)• Return partial results on the same time scale• Provide facility to abort a request

Page 13: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 13

David Adams

ATLAS

Requirements (cont)History

• Event selection– Identify and record the attributes (including code)

for each event selection algorithm

• Dataset– Identify and record each dataset

– Provide mechanism to recover the selection algorithm(s) used to construct a dataset

Page 14: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 14

David Adams

ATLAS

DesignDataset

• This description of a set of event data is the basis for all analysis

– More on this later

Analyzer• User works in an analysis framework which

provides the tools required to view and process histograms and tuples

• ROOT is one example

Page 15: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 15

David Adams

ATLAS

Design (cont)Task

• Specifies the operation to perform on each event including

– Number of event selections to be performed

– Histograms to be filled

– Tuple to be filled

– Code which makes selections and fills histograms and tuples

Page 16: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 16

David Adams

ATLAS

Design (cont)Application

• Description of the executable run by jobs• Loops over events in a dataset• Executes task on each to generate event result• Merges successful event results to form a

dataset result• Specification includes

– Application name> E.g. Athena or ROOT

– Version or acceptable versions

Page 17: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 17

David Adams

ATLAS

Design (cont)Event result

• Flag indicating whether event was accepted for each event selection entry

• Histogram entries for each fill• Tuple values for each fill• Return status from task

– Success or failure

Page 18: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 18

David Adams

ATLAS

Design (cont)Dataset result

• New dataset for each event selection– Old dataset plus list of ID’s for each event selection

• Filled histograms• Filled tuples• List of events for for which task processing was

unsuccessful

Page 19: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 19

David Adams

ATLAS

Design (cont)Job scheduler

• Receives request (application, task and dataset) from analyzer

• May divide dataset into sub-datasets• Creates or locates jobs with a matching

application (and possibly task)• Adds task to jobs if needed• Passes a dataset to each job, invokes task and

receives result• Merges results and returns to analyzer

Page 20: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 20

David Adams

ATLAS

Design (cont)

Analyzer

Job 1

Job 2

Application Task

Dataset 1

Scheduler

1. create

2. create 3. create

4. create

7. create(app,tsk)

5. submit(app,tsk,ds)

7. create(app,tsk)

6. splitDataset

Dataset 2

6. create

8. submit(tsk,ds1)

8. submit(tsk,ds2)

Page 21: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 21

David Adams

ATLAS

DatasetsDatasets provide interface and means for accessing event data

• Different types– Raw

– Reconstructed

– Summary

– Tag

• Organized into EDO’s (event data objects)– Dataset does not see inside EDO

• Following plots give some examples

Page 22: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 22

David Adams

ATLAS

Datasets (cont)R aw

T ra c kC lu s te rs

F o u n dT ra c k s

R e f itT ra c k s

Elec tro ns

E MC lu s te rs

R aw

T ra c kC lu s te rs

F o u n dT ra c k s

R e f itT ra c k s

Elec tro ns

E MC lu s te rs

T w o co m p le te even t view s w ith th e s am e co n ten t .

re c o 1

re c o 2

Page 23: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 23

David Adams

ATLAS

Datasets (cont)R aw

T ra c kC lu s te rs

F o u n dT ra c k s

R e f itT ra c k s

Elec tro ns

E MC lu s te rs

R aw

T ra c kC lu s te rs

F o u n dT ra c k s

R e f itT ra c k s

Elec tro ns

E MC lu s te rs

T w o in co m p le te an d co n s is ten t even t view s w ith th e s am e co n ten t .

abse nt

Page 24: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 24

David Adams

ATLAS

Datasets (cont)R aw

T ra c kC lu s te rs

F o u n dT ra c k s

R e f itT ra c k s

Elec tro ns

E MC lu s te rs

R e f itT ra c k s

Elec tro ns

A m b igu o u s even t view .

R aw

T ra c kC lu s te rs

F o u n dT ra c k s

R e f itT ra c k s

Elec tro ns

E MC lu s te rs

R e f itT ra c k s

Elec tro ns

In co n s is ten t even t view .

Not allowed

Not allowed?

Page 25: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 25

David Adams

ATLAS

Dataset interfaceEvent range

• Collection of event ID’s

Content• Collection of content ID’s

Event data (event views)• For each event ID-content ID pair:

– A means to access the corresponding EDO or

– A flag indicating the EDO is not included

• No other event data is included

Page 26: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 26

David Adams

ATLAS

Dataset interface (cont)

Eve

nt I

D

Versio n (c

od e, p ara

ms) C o ntent (typ e-key, P C , s tream)

Eve nt l is t

5 File s

1 D atase t

E x am p le o f a d a ta s e t an dits m ap p in g to d a ta fi le s

Page 27: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 27

David Adams

ATLAS

Dataset implementationDatasets are used in many ways

• Inspection by humans• I/O for processing in C++

– And other languages

• Cataloging in DB’s

Implementation• Prefer something object oriented• At present, C++ classes with XML persistence

Page 28: David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk

August 5, 2002DIAL BNL OMEGA talk 28

David Adams

ATLAS

Status and conclusionsHigh-level design for DIAL is in place

• Described in this talk• See http://www.usatlas.bnl.gov/~dladams/dial

Detailed design and first implementation of datasets is finished

• See http://www.usatlas.bnl.gov/~dladams/dataset