14
File-Metadata Management System For The LHCb Experiment Carmine Cioffi Department of Physics, University of Oxford CHEP04 Interlaken, 27 September 2004

File-Metadata Management System For The LHCb Experiment Carmine Cioffi Department of Physics, University of Oxford CHEP04 Interlaken, 27 September 2004

Embed Size (px)

Citation preview

Page 1: File-Metadata Management System For The LHCb Experiment Carmine Cioffi Department of Physics, University of Oxford CHEP04 Interlaken, 27 September 2004

File-Metadata Management System

For The LHCb Experiment Carmine

CioffiDepartment of

Physics, University of Oxford

CHEP04 Interlaken,

27 September 2004

Page 2: File-Metadata Management System For The LHCb Experiment Carmine Cioffi Department of Physics, University of Oxford CHEP04 Interlaken, 27 September 2004

CHEP04 Interlaken 27 September 2004

File-Metadata Management system 2

Outline

• What are Metadata and why we need them in the LHCb experiment.

• The File-Metadata Management System– The two schema strategy– XML and the warehousing database – Services and specialised views– Relationship between the warehousing database

and views.– Web Services

• ARDA and future planning

Page 3: File-Metadata Management System For The LHCb Experiment Carmine Cioffi Department of Physics, University of Oxford CHEP04 Interlaken, 27 September 2004

CHEP04 Interlaken 27 September 2004

File-Metadata Management system 3

Metadata

• Generally speaking, metadata are data which characterise data-files

• The two facets of metadata– Job provenance: Everything you ever

wanted to know about how a data-file was created

– Bookkeeping: How do I identify the datasets I am interested in for my analysis ?

• Metadata are needed to get straight to the files of interest, avoiding unnecessary access to the data storage.

Page 4: File-Metadata Management System For The LHCb Experiment Carmine Cioffi Department of Physics, University of Oxford CHEP04 Interlaken, 27 September 2004

CHEP04 Interlaken 27 September 2004

File-Metadata Management system 4

The two schema strategy

• The two schema strategy consists of having a Database (Warehousing DB) and a View of it, both with their own schema. – The Warehousing DataBase (WDB) is

meant to store data in a simple way but be flexible enough to accept new data.

– The View is designed to be efficient for the service it is made for.

Page 5: File-Metadata Management System For The LHCb Experiment Carmine Cioffi Department of Physics, University of Oxford CHEP04 Interlaken, 27 September 2004

CHEP04 Interlaken 27 September 2004

File-Metadata Management system 5

Entity-Relationship

model for WDB

Page 6: File-Metadata Management System For The LHCb Experiment Carmine Cioffi Department of Physics, University of Oxford CHEP04 Interlaken, 27 September 2004

CHEP04 Interlaken 27 September 2004

File-Metadata Management system 6

XML and the insertion of data

• Due to the key-value strategy the WDB is liable to be corrupted:– Any data with any semantic can be

inserted.– Partial information can be inserted.

• To prevent this the data must be presented in XML format. In this way, using a predefined DTD/XML-SCHEMA it is possible to verify the correctness of the data.

Page 7: File-Metadata Management System For The LHCb Experiment Carmine Cioffi Department of Physics, University of Oxford CHEP04 Interlaken, 27 September 2004

CHEP04 Interlaken 27 September 2004

File-Metadata Management system 7

The DTD for the insertion of a job related metadata– <!ELEMENT Job ( (JobOption|TypedParameter|InputFile|OutputFile)*)>

– <!ELEMENT JobOption EMPTY>– <!ELEMENT TypedParameter EMPTY>– <!ELEMENT InputFile EMPTY>– <!ELEMENT OutputFile ((Parameter|Quality)*)>– <!ELEMENT Parameter EMPTY>– <!ELEMENT Quality (Parameter*)>

– <!ATTLIST Job ConfigName CDATA #REQUIRED– ConfigVersion CDATA #REQUIRED– Date CDATA #REQUIRED>– <!ATTLIST JobOption Recipient CDATA #REQUIRED– Name CDATA #REQUIRED– Value CDATA #REQUIRED>– <!ATTLIST TypedParameter Name CDATA #REQUIRED– Value CDATA #REQUIRED– Type (Info|Environment_Variable) #REQUIRED>– <!ATTLIST InputFile Name CDATA #REQUIRED>– <!ATTLIST OutputFile Name CDATA #REQUIRED– TypeName CDATA #REQUIRED– TypeVersion CDATA #REQUIRED>– <!ATTLIST Parameter Name CDATA #REQUIRED– Value CDATA #REQUIRED>– <!ATTLIST Quality Group CDATA #REQUIRED– Flag CDATA #REQUIRED>

Page 8: File-Metadata Management System For The LHCb Experiment Carmine Cioffi Department of Physics, University of Oxford CHEP04 Interlaken, 27 September 2004

CHEP04 Interlaken 27 September 2004

File-Metadata Management system 8

Services and the specialised views

• Sometimes complex SQL queries do not work well for bulk lookups. – But the WDB contains all the information

about the file that can be used to generate specialised views for specific service.

• Knowing the service, the views can be optimised to give the best performance.

Page 9: File-Metadata Management System For The LHCb Experiment Carmine Cioffi Department of Physics, University of Oxford CHEP04 Interlaken, 27 September 2004

CHEP04 Interlaken 27 September 2004

File-Metadata Management system 9

Replica FILE_ID REPLICA LOCATION

DT_JobSummaryJOB_ID CONFIG DBVERSION EVENTTYPE JOBDATE LABORATORY PROGRAM0 INPUTFILE0 PROGRAM1 INPUTFILE1 PROGRAM2 INPUTFILE2

DT_FileSummaryFILE_ID JOB_ID EVENTTYPE EVENTDESCRIPTIONNBEVENTS FILETYPE FILENAME

FILESIZE

Jyth

on

Web

S

erv

er

SER

VLE

TS

XM

LRPC

SPECIALISED VIEW SCHEMA Web

Browser

Example of view with service and

applications

•This example shows the specialised view that sits on back of the XMLRPC and SERVLETS Services.

•These services are used by GANGA and the Web Browser.

GANGA

application

Page 10: File-Metadata Management System For The LHCb Experiment Carmine Cioffi Department of Physics, University of Oxford CHEP04 Interlaken, 27 September 2004

CHEP04 Interlaken 27 September 2004

File-Metadata Management system 10

Jobs

JobParams

FileParams

Files

TypeParams

ConfigNameConfigVersion

Date

ValueName

Type

LogName

ValueNameValueName

QualityParams

ValueName Replica

FILE_ID REPLICA LOCATION

DT_JobSummary

JOB_ID CONFIG DBVERSION EVENTTYPE JOBDATE LABORATORY PROGRAM0 INPUTFILE0 PROGRAM1 INPUTFILE1 PROGRAM2

INPUTFILE2

DT_FileSummary

FILE_ID JOB_ID EVENTTYPE EVENTDESCRIPTIONNBEVENTS FILETYPE FILENAME FILESIZE

Generation of the specialised View

Warehouse DB

Specialised View

Done periodically or on demand based on the needs of the experiment (every night for LHCb). This is fast despite the fact that WDB contains many GB.

SQL script

Page 11: File-Metadata Management System For The LHCb Experiment Carmine Cioffi Department of Physics, University of Oxford CHEP04 Interlaken, 27 September 2004

CHEP04 Interlaken 27 September 2004

File-Metadata Management system 11

Some Numbers

• LHCb is using ORACLE 9i technology for its DB– It is hosted on a cluster of two ‘Sun Fire 280R’

machine– Each with two processors of 750MHz– 2 GB RAM– 600 GB HD

• The DB contains ~20GB of data– Shared between real data and indexing tables– ~2M jobs rows– ~5.5M files rows– ~57M rows in parameters.

Page 12: File-Metadata Management System For The LHCb Experiment Carmine Cioffi Department of Physics, University of Oxford CHEP04 Interlaken, 27 September 2004

CHEP04 Interlaken 27 September 2004

File-Metadata Management system 12

LHCb services

• Actually LHCb is using two services to access the information from the databases:– Servlet service :

•the service allows the selection of datasets based on their history (job provenance) by the web browser.

– XML-RPC service:•access to and modification of the WDB data•allow GANGA to access Bookkeeping data.

Page 13: File-Metadata Management System For The LHCb Experiment Carmine Cioffi Department of Physics, University of Oxford CHEP04 Interlaken, 27 September 2004

CHEP04 Interlaken 27 September 2004

File-Metadata Management system 13

Collaboration with ARDA

• LHCb has engaged a collaboration with ARDA:– Definition of metadata and understanding of LHCb

requirements– Elaboration of a new interface for the manipulation of file-

metadata.– Possible technology (WSDL).– See how this will fit with the already existing LHCb system.

• Stress-test the Bookkeeping services, analysing various behaviours: – Different number of clients– Different queries– Comparison with direct RPC calls

• Implement the new defined interface– Using the actual LHCb File-Metadata DB as back-end– Using the technology developed with ARDA

Page 14: File-Metadata Management System For The LHCb Experiment Carmine Cioffi Department of Physics, University of Oxford CHEP04 Interlaken, 27 September 2004

CHEP04 Interlaken 27 September 2004

File-Metadata Management system 14

CONCLUSIONS

• The two schema strategy works well for LHCb, and with the DC04 its flexibility was well proven, indeed no changes were required to the WDB although new data have been stored.

• Because of key-value nature of the WDB it can be easily adapted for warehousing of any data, including that of other experiments.