Federico Carminati, Peter Hristov NEC’2011 Varna September 12-19, 2011 Federico Carminati, Peter Hristov NEC’2011 Varna September 12-19, 2011 An Update

  • View
    216

  • Download
    2

Embed Size (px)

Text of Federico Carminati, Peter Hristov NEC’2011 Varna September 12-19, 2011 Federico...

  • Federico Carminati, Peter HristovNEC2011 VarnaSeptember 12-19, 2011An Update about ALICE Computing

  • ALICE@NECNEC2001 AliRoot for simulationNEC2003 Reconstruction with AliRootNEC2005 AliRoot for analysisNEC2007 Still no LHC data => Status and plans of the ALICE offline software. Calibration & AlignmentNEC2009 In preparation for the first LHC data, no presentationNEC2011 Almost 2 years of stable data taking, a lot of published physics results => An update about the ALICE computing

  • Why HI collisions?Indication of trans. HG to QGP at Tc170 MeV c1 GeV/fm3Phase trans. or crossover?Intermediate phase of strongly interacting QGP?Chiral symmetry restoration ? Constituent mass current massStudy QCD at its natural energy scale T=QCD=200 MeV by creating a state of matter at high density and temperature using high energetic heavy ion collisions.

    ALICE

  • *History of High-Energy A+B BeamsBNL-AGS: mid 80s, early 90s O+A, Si+A15 AGeV/csNN ~ 6 GeV Au+Au11 AGeV/csNN ~ 5 GeVCERN-SPS: mid 80s, 90s O+A, S+A200 AGeV/csNN ~ 20 GeV Pb+A160 AGeV/csNN ~ 17 GeVBNL-RHIC: early 00s Au+AusNN ~ 130 GeV p+p, d+Au sNN ~ 200 GeV LHC: 2010 (!)Pb+Pb sNN ~ 5,500 (2,760 in 10-12) GeVp+psNN ~ 14,000 (7000 in 10-12) GeV

    2level 0 - special hardware8 kHz (160 GB/sec)level 1 - embedded processorslevel 2 - PCs200 Hz (4 GB/sec)30 Hz (2.5 GB/sec)30 Hz(1.25 GB/sec)data recording &offline processingTotal weight 10,000tOverall diameter 16.00mOverall length 25mMagnetic Field 0.5TeslaALICE Collaboration~ 1/2 ATLAS, CMS, ~ 2x LHCb~1000 people, 30 countries, ~ 80 Institutes A full pp programmeData rate for pp is 100Hz@1MB

  • OrganizationCore Offline is CERN responsibilityFramework developmentCoordination activitiesDocumentationIntegrationTesting & releaseResource planningEach sub detector is responsible for its own offline system It must comply with the general ALICE Computing Policy as defined by the Computing BoardIt must integrate into the AliRoot frameworkhttp://aliweb.cern.ch/Offline/

  • PLANNINGIN PREPARING FOR BATTLE I ALWAYS FOUND PLANS USELESS BUT PLANNING ESSENTIAL GEN D.EISENHAUER(155 open items, 3266 total)

  • RESOURCESSore point for ALICE computing

  • *Computing model ppPass 1& 2 reco

  • *Computing model AAHI data takingLHC shutdownPass 1& 2 reco

  • Prompt reconstructionBased on PROOF (TSelector)Very useful for high-level QA and debuggingIntegrated in the AliEVE event displayFull Offline code sampling events directly from DAQ memory

  • *VisualizationV0

  • *--(p-)ALICE Analysis Basic ConceptsAnalysis ModelsPrompt data processing (calib, align, reco, analysis) @CERN with PROOFBatch Analysis using GRID infrastructureLocal analysisInteractive analysis PROOF+GRIDUser InterfaceAccess GRID via AliEn or ROOT UIsPROOF/ROOTEnabling technology for (C)AFGRID API class TAliEnAnalysis Object Data contain only data needed for a particular analysisExtensible with -AODs Same user code local, on CAF and GridWork on the distributed infrastructure has been done by the ARDA project-K-(p-)

  • Analysis trainAOD production is organized in a train of tasksTo maximize efficiency of full dataset processingTo optimize CPU/IOUsing the analysis frameworkNeeds monitoring of memory consumption and individual tasks

  • Analysis on the Grid

  • Production of RAW Successful despite rapidly changing conditions in the code and detector operation74 major cycles7.2109 events (RAW) passed through the reconstructionProcessed 3.6PB of dataProduced 0.37TB of ESDs and other data

  • *Sending jobs to dataALICE Job CatalogueALICE File Catalogue

    Job 1lfn1, lfn2, lfn3, lfn4Job 2lfn1, lfn2, lfn3, lfn4Job 3lfn1, lfn2, lfn3

    lfn guid {ses}lfn guid {ses}lfn guid {ses}lfn guid {ses}lfn guid {ses}

    Job 1.1lfn1Job 1.2lfn2Job 1.3lfn3, lfn4Job 2.1lfn1, lfn3Job 2.1lfn2, lfn4Job 3.1lfn1, lfn3Job 3.2lfn2

  • *Storage strategyDiskSRMSRMSRMSRM

  • The access to the dataApplicationDirect access to datavia TAliEn/TGrid interface

    ev#guidTag1, tag2, tag3ev#guidTag1, tag2, tag3ev#guidTag1, tag2, tag3ev#guidTag1, tag2, tag3

  • The ALICE way with XROOTDPure Xrootd + ALICE strong authz plugin. No difference among T1/T2 (only size and QOS)WAN-wide globalized deployment, very efficient direct data accessTier-0: CASTOR+Xrd serving data normally.Tier-1: Pure Xrootd cluster serving conditions to ALL the GRID jobs via WANOld DPM+Xrootd in some tier2sXrootd site(GSI)A globalized clusterALICE global redirectorLocal clients workNormally at each siteMissing a file?Ask to the global redirectorGet redirected to the rightcollaborating cluster, and fetch it.Immediately.A smart clientcould point hereAny otherXrootd siteXrootd site(CERN)CmsdXrootdVirtualMassStorageSystem built on data GlobalizationMore details and complete info in Scalla/Xrootd WAN globalization tools: where we are. @ CHEP09

  • *CAFThe whole CAF becomes a xrootd clusterPowerful and fast machinery very popular with users Allows for any use pattern, however quite often leading to contention for resourcesExpected speedupObserved speedup70% utilization

    lfn guid {ses}lfn guid {ses}lfn guid {ses}lfn guid {ses}lfn guid {ses}

  • *Analysis facilities - profile 1.8 PB of data through CAF, 550TB through SKAF For comparison on the Grid, we have written 15PB, read 37PB

  • The ALICE GridAliEn working prototype in 2002Single interface to distributed computing for all ALICE physicistsFile catalogue, job submission and control, software management, user analysis~80 participating sites now1 T0 (CERN/Switzerland)6 T1s (France, Germany, Italy, The Netherlands, Nordic DataGrid Facility, UK) KISTI and UNAM coming (!)~73 T2s spread over 4 continents~30,000 (out of ~150,000 WLCG) cores and 8.5 PB of diskResources are pooled togetherNo localization of roles / functionsNational resources must integrate seamlessly into the global grid to be accounted forFAs contribute proportionally to the number of PhDs (M&O-A share)T3s have the same role than T2s, even if they do not sign the MoUhttp://alien.cern.ch

  • All is in MonALISA

  • GRID operation principleCentral AliEn servicesSite VO-boxSite VO-boxSite VO-boxSite VO-boxSite VO-boxWMS (gLite/ARC/OSG/Local)SM (dCache/DPM/CASTOR/xrootd)Monitoring, Package managementThe VO-box system (very controversial in the beginning) Has been extensively testedAllows for site services scalingIs a simple isolation layer for the VO in case of troubles

  • Operation central/site supportCentral services support (2 FTEs equivalent)There are no experts which do exclusively support there are 6 highly-qualified experts doing development/supportSite services support - handled by regional experts (one per country) in collaboration with local cluster administratorsExtremely important part of the systemIn normal operation ~0.2FTEs/siteRegular weekly discussions and active all-activities mailing lists

  • SummaryALICE offline framework (AliRoot) is mature project that covers simulation, reconstruction, calibration, alignment, visualization and analysisSuccessful operation with real data since 2009The results for several major physics conferences were obtained in timeThe Grid and AF resources are adequate to serve the RAW/MC and user analysis tasksMore resources would be better of courseThe sites operation is very stableThe gLite (EMI now) software is mature and few changes are necessary

  • Some Philosophy

  • *The codeMove to C++ was probably inevitableBut it made a lot of collateral damageLearning process was long, and it is still going onVery difficult to judge what would have happened had root not been thereThe most difficult question is now what nextA new language? there is none at the horizonDifferent languages for different scopes (python, java, C, CUDA) just think about debuggingA better discipline in using C++ (in ALICE no STL / templates)Code management tools, build systems, (c)make, autotoolsStill a lot of glue has to be provided, no comprehensive system out of the box

  • *The GridA half empty glassWe are still far from the VisionA lot of tinkering and hand-holding to keep it alive4+1 solutions for each problemWe are just seeing now some light at the end of the tunnel of data managementThe half full glassWe are using the Grid as a distributed heterogeneous collection of high-end resources, which was the idea after allLHC physics is being produced by the Grid

  • *Grid need-to-haveFar more automation and resilienceMake the Grid less manpower intensiveMore integration between workload management and data placementBetter control of upgrades (OS, MW)Or better transparent integration of different OS/MWIntegration of the network as an active, provisionable resourceClose storage element, file replication / caching vs remote accessBetter monitoringOr perhaps simply more coherent monitoring...

    *Get the right numbers for ALICE**