The PHysics Analysis SERver Project (PHASER) CHEP 2000 Padova, Italy February 7-11, 2000 M. Bowen, G. Landsberg, and R. Partridge* Brown University

The PHysics Analysis SERver Project(PHASER)

CHEP 2000

Padova, Italy

February 7-11, 2000

M. Bowen, G. Landsberg, and R. Partridge*

Brown University

Richard Partridge 2

What is the PHASER project?

Effort to substantially increase productivity of physicists analyzing multi-TB summary data sets

Our immediate focus is on the DØ experiment» 600 million data events/year starting in early 2001

» Summary data set expected to grow at rate of 3TB/year

Concentrate on event selection and “ntuple” creation stage» transition in data handling from monolithic reconstruction processing to

the much more chaotic processing of summary data by many physicisits

» IO and CPU intensive due to need to apply latest calibration, particle ID, and event selection algorithms to several hundred million events

Richard Partridge 3

PHASER Architecture Physics Object Database

(POD) stores meta-data used by most physics analyses for their initial event selection

Physics Object and Particle ID tables in POD store calibrated 4-vectors, object quality variables, and results of particle ID algorithms

DVD storage of full summary (DST) data set and useful subsets of larger DST and STA data sets

Richard Partridge 4

PHASER is PHast

New calibrations and particle ID algorithms can be quickly incorporated» Only the changes need to be importd

» Regenerating the large DST data set will only be done infrequently

Storage of up-to-date calibrations and particle ID algorihtms avoids the need to re-apply these alogorithms for each event selection pass

Particle ID tables are small, making it possible to quickly eliminate events not having the desired set of physics objects

Direct access to full DST sample on DVD allows a DST subset to be quickly generated for advanced analyses developing new algorithms not yet in the database

Richard Partridge 5

The Physics Object Database (POD)

Stores fully calibrated meta-data associated with the various physics objects» leptons, photons, jets, missing ET, secondary vertices, triggers, etc.

» for example, an electron object would have the energy, direction, and various quantities used in the electron ID algorithms stored

Each physics object associated with a table in a relational database

Primary key uniquely identifies each physics object and provides information needed to correlate physics objects from a single event» Currently use Run, Event, Instance (where appropriate) and row number

from ntuple used to load database

» Alternative: data source index, sequence number, and instance

Richard Partridge 6

Why use a Relational Database?

Physics objects typically have a fixed set of attributes used for event selection and analysis

Independence of tables aids loading, updating database» Data can be “bulk loaded” as long as primary key is provided in input data

stream

Several vendors with quite capable products, large commercial market

Richard Partridge 7

Prototype POD

Use DØ Run 1 data (1992 - 1996 running period) 62 million events loaded into the database Entire “All-Stream” data set loaded

» Data set used by almost all DØ physics analyses

» Only files with special processing or trigger conditions excluded

Column-wise ntuple format used for importing/exporting data

Richard Partridge 8

DØ Run 1 PODObject Columns Rows Size (GB)

Electron 28 52,540,491 6.8Muon 37 79,688,956 13.2Photon 22 69,278,259 7.4Jets (3 cone sizes) 3 x 14 472,626,080 35.7Jets with e/ removed (3 cone sizes) 3 x 6 67,003,537 3.1Missing ET 14 62,353,601 4.8Vertex 6 90,004,529 4.1Trigger 19 62,353,601 3.5Event Parameters 5 62,353,601 1.8Totals 191 1,018,202,655 80.4

Including indexes, Run 1 POD occupies ~100 GB» 58% physics object data

» 18% indexes on object ET

» 12% primary keys

» 12% database overhead

Richard Partridge 9

POD Benchmarks

Z e+e- candidate event selection:» 7 seconds to identify ~6k events

W ecandidate event selection:» 18 seconds to identify ~86k events

Both benchmarks times make use of particle ID tables Event selection times compare very favorably with ~1000

CPU hours required to generate ntuples used in this study

Benchmark Hardware/Software 450 MHz dual-processor Pentium II with 256 MB RAM Database stored on (6) 36 GB disks in Raid 0 stripe set MS SQL Server running on Windows NT 4.0

Richard Partridge 10

DVD Storage

Provide access to additional event information not included in POD

DVD-RAM has a number of unique capabilities» Less expensive than disk storage, doesn’t require backup

» Access to individual events is much faster than tape storage

Current disk capacity is 2.6 GB, 4.7 GB expected soon Commercial DVD libraries hold up to 600 DVD disks

» 2.8 TB capacity using 4.7 GB DVD-RAM disks

» Average disk load time of 4.5 s, <1 hour to cycle through 600 disks

» Up to 6 DVD-RAM drives gives ~10 MB/s IO rate


Web Interface

Plan to develop web-based user interface Interface modelled on “3-tier” architecture widely used in

commercial applications Physicist will enter event selection requirements using a

Java applet Applet communicates request to “Physics Intelligence”

middleware running on PHASER system (via CORBA)» Translate request to SQL for event selection

» Verify that request can be accommodated within resource constraints

» Produce the requested output files


PHASER Output

Several output options:» List of run and event numbers satisfying the request

» Ntuple created from POD information

» DST stream containing requested events from DVD library

Output files will generally be small enough to transfer over the network

Larger output files can be written to DVD and physically sent to physicist for further analysis


Conclusions

PHASER offers a way for both experts, novices, and “dinosaurs” to quickly extract information about a particular class of events

Feasibility of loading “Run 1” size physics object info into a relational database has been demonstrated

Significant improvements in event selection time has been observed for W/Z benchmarks

Expect these results will scale up to Run 2 data load Database technology is also potentially useful for helping

manage complex analyses and storing intermediate results

Documents

The PHysics Analysis SERver Project (PHASER) CHEP 2000 Padova, Italy February 7-11, 2000 M. Bowen, G. Landsberg, and R. Partridge* Brown University