36
Status of DØ Status of DØ Computing at UTA Computing at UTA •Introduction •The UTA – DØ Grid team •DØ Monte Carlo Production •The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development Effort •Impact on Outreach and Education •Conclusions DoE Site Visit DoE Site Visit Nov. 13, 2003 Nov. 13, 2003 Jae Yu Jae Yu University of Texas at Arlington University of Texas at Arlington

Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Embed Size (px)

Citation preview

Page 1: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Status of DØ Computing at UTAStatus of DØ Computing at UTA

•Introduction •The UTA – DØ Grid team•DØ Monte Carlo Production•The DØ Grid Computing

–DØRAC–DØSAR–DØGrid Software Development Effort

•Impact on Outreach and Education•Conclusions

DoE Site VisitDoE Site VisitNov. 13, 2003Nov. 13, 2003

Jae YuJae YuUniversity of Texas at ArlingtonUniversity of Texas at Arlington

Page 2: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

2

• UTA has been producing DØ MC events as the US leader• UTA led the effort to

• Start remote computing at DØ• Define remote computing architecture at DØ• Implement the remote computing design at DØ in the US

• Leverage on experience as the ONLY active US DØ MC farm This became no longer true

• UTA is the leader in US DØ Grid effort• The UTA DØ Grid team has been playing a leadership role

in monitoring software development

Introduction

Page 3: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

3

The UTA-DØGrid Team• Faculty: Jae Yu, David Levine (CSE)• Research Associate: HyunWoo Kim

– SAM/Grid expert– Development of McFarm SAM/Grid job manager

• Software Program Consultant: Drew Meyer– Development, improvement, and maintenance of McFarm

• CSE Master’s Degree Students:– Nirmal Ranganathan: Investigation of Resource needs in Grid execution

• EE M.S. Student: Prashant Bhamidipati– MC Farm operation and McPerM development

• PHY Undergraduate Student: David Jenkins – Take over MC Farm Operation and Development of Monitoring database

• Graduated:– Three CSE MS students All are at industry– One CSE Undergraduate student on MS program at U. of Washington

Page 4: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

4

UTA DØ MC Production• Have two independent farms

– Swift farm (HEP)• 36 P3 866MHz cpu’s• 250Mbyte/cpu• A total of .6TB disk space

– CSE Farm• 12 P3 866MHz cpu’s

• McFarm as our production control software

• Statistics (11/1/2002 – 11/12/2003):– Produced: ~10M– Delivered: ~ 8M

Page 5: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

5

What do we want to do with the data?

Want to analyze data no matter where we are!!!

Location and time independent analysis

Page 6: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

6

DØ Data Taking Summary

30~40M events/mo

Page 7: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

7

• Total expected data size is ~4PB (4 million GB=100km of 100GB Hard drives)!!!

• Detectors are complicated Need many people to construct and make them work

• Collaboration is large and scattered all over the world• Allow software development at remote institutions• Optimized resource management, job scheduling, and

monitoring tools• Efficient and transparent data delivery and sharing

What do we need for efficient data analyses in a HEP experiment?

Page 8: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

8

650 Collaborators78 Institutions18 Countries

DØ Collaboration

Page 9: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

9

Old Deployment ModelsStarted with Fermilab-centric SAM infrastructure in place, …

…transition to hierarchically distributed Model

Page 10: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

10

Central Analysis Center (CAC)

DesktopAnalysis Stations

DAS DAS…. DAS DAS….

InstitutionalAnalysis Centers

IAC ... IAC IAC…IAC

Normal InteractionCommunication PathOccasional Interaction Communication Path

RegionalAnalysis Centers

RAC …. RAC

DØ Remote Analysis Model (DØRAM)

Page 11: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

11

What is a DØRAC?• A large concentrated computing resource hub• An institute willing to provide storage and

computing services to a few small institutes in the region

• An institute capable of providing increased infrastructure as the data from the experiment grows

• An institute willing to provide support personnel• Complementary to the central facility

Page 12: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

12

DØ Southern Analysis Region (DØSAR)

The first US Region centered around the UTA – RAC

Mexico/Brazil

OU/LU

UAZ

RiceLTU

UTA

KUKSU

Ole Miss

It is a regional virtual organization (RVO) within the greater DØ VO!!

Page 13: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

13

SAR Institutions

• First Generation IAC’s– Langston University– Louisiana Tech University– University of Oklahoma– UTA

• Second Generation IAC’s– Cinvestav, Mexico– Universidade Estadual Paulista, Brazil – University of Kansas– Kansas State University

• Third Generation IAC’s– Ole Miss, MS– Rice University, TX– University of Arizona, Tucson, AZ

Page 14: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

14

Goals of DØ Southern Analysis Region• Prepare institutions within the region for grid enabled

analyses using RAC at UTA• Enable IAC’s to contribute to the experiment as much

as they can, including MC production and data re-processing

• Provide GRID enabled software and computing resources to DØ collaboration

• Provide regional technical support and help new IAC’s• Perform physics data analyses within the region• Discover and draw in more computing and human

resources from external sources

Page 15: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

15

SAR Workshops• Biennial Workshops to promote healthy regional

collaboration and to share expertise• Had two workshops

– April 18 – 19, 2003 at UTA: ~40 participants– Sept. 25 – 26, 2003 at OU: 32 participants

• Each workshop had different goals and outcomes– Established SAR, RAC & IAC web pages and e-mail– Identified Institutional representatives– Enabled three additional IAC’s with MC production– Paired new institutions with existing ones

Page 16: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

16

SAR Strategy• Setup all IAC’s with full DØ Software setup

(DØRACE Phase 0 – IV)• Install Condor (or PBS) batch control system on

desktop farms or clusters• Install McFarm MC Production control• Produce MC events on IAC machines• Install globus for monitoring information transfer• Install SAM-Grid and interface McFarm to it• Submit jobs through SAM/Grid and monitor them• Perform analysis at individual’s desk

Page 17: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

17

SAR Software Status

• Up-to-date with DØ Releases• McFarm MC Production control • Condor or PBS as batch control• Globus v2.xx for grid enabled communication

– Globus & DOE SG Certificates obtained and installed

• SAM/Grid on two of the farms (UTA IAC farms)

Page 18: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

18

UTA Software for SAR• McFarm Job control

– All DØSAR institutions use this product for automated MC Production

• Ganglia resource monitoring– Contains 7 clusters (332 CPU’s), including Tata institute,

India• McFarmGraph: MC Job status Monitoring system

using gridftp – Provides detailed information for a MC request

• McPerM: MC Farm Performance Monitoring

Page 19: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

19

Ganglia Grid Resource Monitoring

1st SAR wrkshp

Page 20: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

20

Job Status Monitoring: McFarmGraph

Page 21: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

21

Farm Performance Monitor: McPerM

Increased Productivity

Page 22: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

22

UTA RAC and Its Status• NSF MRI funded facility

– Joint proposal of UTA HEP and CSE + UTSW Med.– 2 HEP, 10 CSE and 2 UTSW Medical

• Core System (high throughput Research system)– CPU: 64 P4 Xeon 2.4GHz (total ~154 GHz)– Memory & NIC: 1 GB/CPU & 1 Gbit/sec port each (total of 64 Gbytes)– Storage: 5TB Fiber Channel supported by 3 GFS servers (3Gbit/sec throughput)– Network: Faundary switch w/ 52 Gbit/sec + 24 100Mbit/sec ports

• Expansion system (high CPU cycle, large storage Grid system)– CPU: 100 P4 Xeon 2.6GHz (total ~260 GHz)– Memory & NIC: 1 GB/CPU & 1 Gbit/sec port each (total of 100 Gbytes)– Storage: 60TB IDE RAID supported by 10 NFS servers– Network: 52 Gbit/sec

• The full facility went online on Oct. 31, 2003• Software installation in progress• Plan to participate in SC2003 demo next week

Page 23: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

23

Just to Recall Two Years Ago….

Disk Server

.

.

.

IDE-RAID

IDE-RAID

IDE-RAID

IDE-RAID

Gbit Switch•IDE Hard drives are ~$2.5/GByte•Each set of IDE RAID array gives ~1.6TByte – hot swappable•Can be configured to have up to 10-16TB in a rack•Modest server can manage the entire system•Gbit network switch provide high throughput transfer to outside world•Flexible and scalable system•Need an efficient monitoring and error recovery system•Communication to resource management

Page 24: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

24

UTA DØRAC•100 P4 Xeon 2.6GHz CPU = 260 GHz•64TB of Disk space

•84 P4 Xeon 2.4GHz CPU = 202 GHz•7.5TB of Disk space

•Total CPU: 462 GHz•Total disk: 73TB•Total Memory: 168Gbyte•Network bandwidth: 54Gb/sec

Page 25: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

25

SAR Accomplishments• Held two workshops and the third is planned• All first generation institutions produce MC events using

McFarm on desktop PC farms – Generated MC events: OU: 300k, LU: 250k, LTU: 150k, UTA: ~1.3M– Discovered additional resources

• Significant local expertise have been accumulated in running farms and producing MC events

• Produced several documents, including two DØ notes• Hold regular bi-weekly meetings (VRVS) to keep up progress• Working toward data re-processing

Page 26: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

26

SAR Computing ResourcesInstitutions CPU (GHz) Storage (TB) People

Cinvestav 13 1.1 1F+?Langston 13 1 1F+1GA

LTU 25+12 0.5+0.5 1F+1PD+2GAKU 12 ?? 1F+1PD(?)

KSU 40 1.2 1F+2GAOU 36+27(OSCER) 1.8 + 120(tape) 4F+3PD+2GA

Sao Paulo 60+144(future) 3 1F+ManyUTA 192 31 2F+1.4PD+0.5C+

3GATotal 430 40+120(tape) 12F+6PD+10GA

Page 27: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

27

SAR Plans• Four second generation IAC’s have been paired with four

first generation institutions– Success is defined as:

• Regular production and delivery of MC events to SAM using McFarm• Install SAM/’Grid and perform a simple SAM job

– Add all these new IAC’s to ganglia, McFarmGraph and McPerM• Discover and integrate more resources for DØ

– Integrate OU’s OSCER cluster– Integrate other institution’s large, university-wide resources

• Move toward grid enabled regional physics analyses– Collaborators need to be educated to use the system

Page 28: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

28

Future Software Projects• Preparation of UTA DØRAC equipment

– MC Production (DØ is suffering from shortage of resources.) – Re-reconstruction– SAM/Grid

• McFarm – Integration of re-processing– Enhanced monitoring– Better error handling

• McFarm Interface to SAM/Grid (job_manager)– Initial script successfully tested for SC2003 demo

• Work with SAM-Grid team for monitoring database and integration of McFarm technology

• Improvement and maintenance of McFramGraph and McPerM• Universal Graphical User Interface to Grid ( PHY PhD Student)

Page 29: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

29

SAR Physics Interests• OU/LU:

– EWSB/Higgs searches– Single top search– CPV / Rare decays in heavy flavors– SUSY

• LTU:– Higgs search– B-tagging

• UTA:– SUSY– Higgs searches– Diffractive physics

• Diverse topics but can define common samples

Page 30: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

30

Funding at SAR• Hardware Support

– UTA – RAC : NSF MRI– UTA – IAC : DoE + Local

• Totally independent of RAC resources• Need to more hardware to adequately support desktop analyses

utilizing RAC resources

• Software Support– Mostly UTA Local funding Will run out this year!!!– Many tries for different sources but none worked

• We seriously need help to – Maintain the leadership in DØ Remote Computing– Maintain the leadership in grid computing – Realize the DØRAM and expeditious physics analyses

Page 31: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Tevatron Grid Framework: SAM-Grid• DØ already has data delivery part of the Grid system (SAM)• Project started in 2001 as part of the PPDG collaboration to handle

DØ’s expanded needs.• Current SAM-Grid team includes:

– Andrew Baranovski, Gabriele Garzoglio, Lee Lueking, Dane Skow, Igor Terekhov, Rod Walker (Imperial College), Jae Yu (UTA), Drew Meyer (UTA), HyunWoo Kim (UTA) in Collaboration with U. Wisconsin Condor team.

http://www-d0.fnal.gov/computing/grid• UTA is working on developing an interface for McFarm to SAM-Grid

• This brings the entire SAR institutions + any institutions with McFarm into the DØGrid

Page 32: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

32

Fermilab Grid Framework (SAM-Grid)

UTA

Page 33: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

33

UTA-FNAL CSE Master’s Student Exchange Program• In order to establish usable Grid software in the DØ time scale,

the project needs highly skilled software developers– FNAL cannot afford computer professionals– UTA - CSE department has 450 MS students Many are highly trained

but back at school due to economy– Students can participate in cutting-edge Grid computing topics in real-life

situation– Students’ Master’s thesis become a well documented record of the work

which lacks in many HEP computing projects • The third generation students are at FNAL working on

improvement of SAM – Grid and its implementation two semester circulation period

• Previous two generations have made a significant impact to SAM – Grid – One of the four previous generation students is in PhD program at CSE– One at Wisconsin Condor team Possibility to get into PhD– Two are at industry

Page 34: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

34

Impact to Education and Outreach• UTA DØ Grid program graduated

– Trained: 12 (10 MS + 1 Undergraduate) students– Graduated: 5 CSE Masters + 1 Under grad– CSE Grid Course: Many class projects on DØ

• Quarknet– UTA is one of the founding institutions of QuarkNet programs– Initiated TECOS project– Other School-top cosmic projects across the nation need storage and

computing resources QuarkNet Grid– Will be working with QuarkNet for data storage & eventual use of

computing resources by teachers and students• UTA Recently became a member of Texas grid (HiPCAT)

– HEP is leading this effort– Strongly supported by the university – Expect significant increase in infrastructure, such as bandwidth

Page 35: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

35

Conclusions• UTA DØ – Grid team has accomplished tremendously• UTA played a leading role in DØ Remote Computing

– MC production– Design of DØ Grid architecture– Implementation of the DØRAM

• DØ Southern Analysis Region is a great success– Four new institutions (3 US) are now MC production sites– Enabled exploitation of available intelligence and resources in

an extremely distributed environments– Remote expertise being accumulated

Page 36: Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing –DØRAC –DØSAR –DØGrid Software Development

Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu

36

• UTA – DØRAC is up and running Software installation in progress– Soon to add significant resources to SAR and to DØ

• Sam-Grid interface to McFarm working One step closer to establish a globalized grid

• UTA – FNAL MS student exchange program is very successful

• UTA DØ Grid computing program has significant impact to outreach and education

• UTA is the ONLY DØ US institution who’s been playing a leading role in DØ grid Makes UTA unique

• The local support runs out this year!! UTA needs support to maintain leadership in and support for DØ Remote Computing