14
Report on HEPiX Meeting Spring ‘10 A short personal view Thomas Finnern (DESY/IT) HEPiX Spring 2010 Lisbon, Portugal

Report on HEPiX Meeting Spring ‘10 - desy.de · Report on HEPiX Meeting Spring ‘10 A short personal view Thomas Finnern (DESY/IT) HEPiX Spring 2010 Lisbon, Portugal

Embed Size (px)

Citation preview

Report on HEPiX Meeting Spring ‘10A short personal view

Thomas Finnern(DESY/IT)

HEPiX Spring 2010

Lisbon , Portugal

Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 2

„Weniger Asche für ihre Cloud“

> Most Americans in Lisbon

> Most Europeans on EVO

� ~60-65 connections in average

Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 3

What is HEPiX ?

> A global organisation since 1991

> Unites IT system support staff, including system administrators, system engineers, and managers from the High Energy Physics (HEP) and Nuclear Physics laboratories and institutes

> BNL, CERN, DESY, FNAL, IN2P3, INFN, JLAB, NIKHEF, RAL, SLAC, TRIUMF and many others

> Semi-annual meetings are an excellent source of information for IT specialists in scientific computing

> http://www.hepix.org

Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 4

HEPiX Meeting Event Outline

> Names and Numbers� LIP: Laboratório de Instrumentação e Física Experimental de Partículas

� 105 Participants (6 DESY)

90 from Europan Countries15 from other Countries

� Monday to Friday

> Sky Crisis due to Island Volcano Eyjafjallajökull [‘ɛɪja,fjatl̥a,jœkʏtl̥]

> Daily Key Notes� Conveners: Borges, Goncalo

� Portugal Infrastructure: Ibergrid , Grid Europe – South America,

� CPU-GPU-Clusters

� Management Information Systems

> Site Reports� Convener Michel Jouvin (LAL/IN2P3/GRIF)

> Technical Topics (Tracks)� Virtualisation: Convener Tony Cass (CERN)

� Operating Systems & Applications: Convener Sandy Philpott (JLAB)

� Monitoring & Infrastructure tools: Conveners Helge Meinhard(CERN)

� Storage and Filesystems: Convener: Andrei Maslennikov (CASPUR)

� Grid and WLCG: Conveners: Borges, Goncalo (LIP)

� Security & Networking: Convener: Dr. Kelsey, David (RAL)

� Benchmarking

> Virtualization Working Group F2F� See: Extra Report

> DESY Talks (in order of appearance)� Evaluation of NFS v4.1 (pNFS) with dCache (FUHRMANN, Patrick)

� Building up a high performance data centre with commodity hardware (HAUPT, Andreas)

� DESY site report (FRIEBEL, Wolfgang)

� Virtual Network and Web Services (An Update) (FINNERN, Thomas)

Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 5

Site Reports (Michel Jouvin)

> RAL Site Report: BLY, Martin� Admin & Science Support, SCARF (HPC Cluster), UPS, Chillers &

Pumps, 2,25 MW, Batch 3GB/Core

> BNL RHIC/ATLAS Computing Facility Site Report: HOLLOWELL, Christopher

� Near NY, 100 Thumper/Thor Sol10, 250 SL Infrastructure Server

> CERN site report: Dr. MEINHARD, Helge LHC Status� CERN IT-Reorganisation

� ITIL Implementation ongoing, 100kW missing, big procurement, Windows 7,

� Solaris And Sparc phased out

> LAL/GRIF Site Report: JOUVIN, Michel

> Jefferson Lab Site Report: PHILPOTT, Sandy� ipv6

� CEBAF Upgrade Project

� HPC Infiniband Clusters

� HPC GPU Cluster: 63 Nodes, 200 GPUs

� 200 TB Lustre 1.8.2

� Fedora 32-bit -> CentOS 5.3 64-bit

> Site report from PDSF: SRINIVASAN, Jay(?)

> KIT Site Report: ALEF, Manfred(?)

> CSC Site Report: HAKALA, Tero

> PIC Site Report: MARTINEZ RAMIREZ, Francisco

> SLAC Site Report: MELEN, Randy� Change HEP to Photon Science: Process, + Enterprise architect, …

� 8200 Batch Cores (LSF), subcluster

� infiniband, GPU (CUDA,OpenCL)

> DESY site report: FRIEBEL, Wolfgang� 1a: New directors, buildings

> PSI Site Report: Dr. FEICHTINGER, Derek� Bell

> GSI site report: Mr. HUHN, Christopher

> Petersburg Nuclear Physics Institute (PNPI) status report: SHEVEL, Andrey

� Small site -> small cluster -> small support ? -> cloud gateway at t3 level ?

> Fermilab Site Report: Dr. KEITH, Chadwick� FermiCloud System in GCCi. IAAS, Procurement now

� Nehalem w hyperthreading

� ITIL Transition e.g. Changemanegement

� Poweroutage Fynman Center Downtime 2-4 hours

� + 12 d 2nd Break by 1400 A 3 phase power breaker

� -> running 75 % … +++

� Lessons Learned: trust, communication, HA -> UA, networkrescue

> The ATLAS Great Lakes Tier-2: Status and Plans: Dr. MC KEE, Shawn

> The Portuguese WLCG Tier2: status and issues: DAVID, Mario

> INFN Tier1 site report: SAPUNENKO, Vladimir� 1400 -> 2000 Virtual Machines

� Castor -> GEMSS (StoRM, GPFS, TSM, GridFTP)

> Prague Tier-2 Site Report: SVEC, Jan

Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 6

Virtualisation (Tony Cass)

> Update on HEPiX Working Group on Virtualisation: CASS, Tony

� See: Extra Report

> Virtualization at CERN: a status report: SCHWICKERATH, Ulrich

� PES

� Batch: 3 different VM Layouts

� ISF and/or OpenNebula

� „Golden Nodes“ as VM reference

� Quattor/Lemon

� 1GB Net scp rtorrent compresst images

� Still under test:Large scale

� VM: 24h Lifetime only

� No direct batch control of created image

> virtual machines over PBS: RODRIGUEZ ESPADAMALA, Marc

� PIC

� Startup VM within prolog

� KVM Snapshots

� Need/Want some addons in PBS for better VM support

� DIRAC pilot jobs

> An Adaptive Batch Environment for Clouds: GABLE, Ian

� HEP Legacy Data Project (BaBar), CANFAR (Astrophysics) lot of individual environments

� NIMBUS Context Broker

� SGE ro Amazon EC2

� (Eucalyptus)

� -> Cloud Scheduler: Combine without copying the above

� Early Experiences

� Cloudscheduler.org

> CERN Virtual Infrastructure: VAN ELDIK, Jan

> Virtual Network and Web Services (An Update): FINNERN, Thomas

� F5 Loadbalancer

� DESY web site

� Infoscreen

> Virtualisation for Oracle databases and application servers: GARCIA FERNANDEZ, Carlos

� Number of Instances Increases

� Performance (- few %)

� Live migration

� JRockit direct Application on Hypervisor

� Migration from physical to virtual

� Quattor Based

� Nfs based /OVS -> /var/mount/ovs/<uuid>

� „on the fly“ and „golden images“

Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 7

Operating Systems & Applications (Sandy Philpott)

> Scientific Linux Status Report and Plenary Discussion: Mr. DAWSON, Troy

> Update on Scientific Linux: Troy Dawson, FERMI

� Linux is like a wigwam: no gates, no windows

� and Apache inside.

� "We all know Linux is great...it does infinite loops in 5 seconds." --Linus

� SL5 increasing

� Sl5.4 November 2009 incl. LifeCD

� WIPsl4.9 ? 2010Sl5.5 in beta -> June 2010newer, faster, better distro serversSl6: koji build automationSl6 pre alpha after sl5.5Sl6.0 Februuar 2011 ?Sl3.0.9 end october 2010Sl4 goes in legacy mode

� Spacewalk -> delta e-mails ?

� Discussion:if CentOs then parallel or notExtra rpms without regular security updates inbetween sl release cycleNo extra kernel maintainable by Troy

� Discussion: Startup difficult: Couple of month <- Docu, personal

> Current Windows 7 Status @ CERN: Michal Budzowski

� From vista, but 6000 managed PC‘s (mostly XP)

� Already 430 Installations

� Default 32 bit

� Microsoft / Cern Recommendation 32(64) bitMemory 1(2)G 2(4)GDisk 16(20) GB 60(60) GBCPU 1 Ghz 2Ghz

� Legacy: Old Hardware gone 2012

� Addons/Changes:Password protected screensaverRecently changed to search folderInternational: Language English(UK) Location Switzerland, Keyboard US, Euro

� 2010Q2: Windows 7 default

� 2010Q3: Phase out Vista

� 2010Q4: Roadmap XP

> TWiki at CERN: Past Present and Future Mr. JONES, Pete

� A wiki is a web page with an edit button

� The simplest online database possible

� Confluence,MediaWiki,PhPWikiTikiWiki,Twiki

� Twiki:Since 1998 Open sourcePerl based: Linux + Apache

� Since 2003 TWiki has been upgraded several times

� March 2010 backend migrated from AFS->NFS

� 7500 registered User (CMS, Atlas, …)

� 190 collaboration webs

� 60.000 Topics

� 3.000.000 accesses / month

� 50.000 Updates / month

� No anonymous write

� Access Control: By Username or Group and Egroup

� ENV(HTTP_ADFS_GROUP)

� Issues:Performance, SSO code change, web managementGoogle (search) does not see protected pagesCern search soon also for protected dataNow Twiki.net instead of Open Source ….

� Complements other IT services

Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 8

Monitoring & Infrastructure tools (Helge Meinhard)

> Spacewalk and Koji: Troy Dawson

� Spacewalk: Install / Maintain „channels“

� @Fermi: Test: Group Machines / Set Channels / Seperate Users / …

� Opensource RedHat Satelite System

� Koji for build (redhat/fedora) distro (and sl6 ?)

� Koji -> mock -> … -> rpms

� Mash + Bhodi

> RAL Tier1 Quattor experience and Quattor outlook: COLLIER, Ian Peter

� Quattor Toolkit Introduction

� Profile/Version Control ….

� Stated with sl5 680 Servers + 130 Castor Servers

� The bigger the sites the more quattor usage

� Discussion: Heteroneous hardware difficult -> Separate hw and payload defs

> Lavoisier : a way to integrate heteregeneous monitoring systems: L'ORPHELIN, Cyril

> Scientific Computing: first quantitative methodologies for a production environment: Mr. CIAMPA, Alberto

Cost Evaluation:TCO ? ROI ?

> Lessons Learned from a Site-Wide Power Outage: BARTELT, John

� SLAC: 19.1. Start – 20.1.

� Payroll/printing, light, coffee, communication, priorities, …

� Documentations of dependecies

Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 9

Storage and Filesystems (Andrei Maslennikov)

> CERN Lustre evaluation and storage outlook: BELL, Tim

� 1.7 Beta, HSM, Analysis Space Project Space User homedierectories

� Mandatory: Strong Authentication almost ok , No live datamigration, Backup OK, HA/redundancy ok, Small files almost ok, HSM under development, no replication, no privilege delegation, no strong admin control

� Additional: too strong coupling client/server: versions, kernel, etc

� No Lustre at CERN 4 T0, Analysis or Afs replacement, Interest in fulfilling roadmap

� Big Storage: On Disk and Tape Backup (Write Once, Read never)

> High Performance Storage for LHC: Dr. DUELLMANN, Dirk

> LCLS Data Analysis Facility: WACHSMANN, Alf� Linac Coherent Light Source

� Also with CFEL(DESY)

� Lustre online + offline

> GEMSS: Grid Enabled Mass Storage System for LHC experiments: SAPUNENKO, Vladimir

� CNAF (Italy)

� StoRM/GPFS/TSM

> OpenAFS Performance Improvements: Linux Cache Manager and Rx RPC Library: ALTMAN, Jeffrey WILKINSON, Simon

� Disk cache benefits over memory cache

� Page cache improvements

� Minimize data copies

� OpenAFS Roadmap (ALTMAN)

1.6 Summer 2010, 1.8, 2.0 Summer 2011, 2.21.5 Windows Production1.6: Source Code Quality, MAC OS X.1.6: …., NFS -> AFS Translator1.6: Solaris 111.8: … krb5,gss,x.509,SCRAM,…

> CC IN2P3: A way to combine heterogeneousmonitoring systems

Lavoisier: A data source composition service> Progress Report 2010 for HEPiX Storage Working

Group: MASLENNIKOV, Andrei� Test Facility @ KIT: High End Server + Last Versions: CMS and

Atlas Tests, Questionaire: 87 PtB: 33 % Castor, 33 % dCache, N-Client/N-Server = 10 for 1 Gb-Server, AFS/Lustre, GPFS, dCache

> Evaluation of NFS v4.1 (pNFS) with dCache: FUHRMANN, Patrick

� Nfs4.1 preproduction quality, support by golden release, set_aclby user soon,

� Kernel 2.6.32 first supports nfs4.1 (sl6+), local results stable and fast

> Building up a high performance data centre with commodity hardware: HAUPT, Andreas

� Lustre, Multi Batch Cluster (Batch, Parallel, NAF), DELL, 30 cent/GB

> Lustre-HSM binding: LEIBOVICI, Thomas� CEA/France

� HSM Backend generic (not only HPSS, POSIX, …)

� Oracle/SUN/CFS/CEA/Other

� V1 Feature: MigrateData, free space, recover, policies, import, diesaster recovery

� RobinHood as PolicyEngine

� V2 Features in progress> Fine tuning …

Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 10

Grid and WLCG (Goncalo Borges)

> The new Generations of CPU-GPU Clusters for e-Science: PROENçA, Alberto� Paradigm Change: Computer not becoming faster

� Complex smp + mpp systems

� Single prgs: messaging, loadbalance, …

� New chips with message passing between Cores

� Data parallelism: simd (single instruction multiple data) to gpu (MIMD)

� CUDA: Compute unified device architecture CPU(Ser.) + GPU(par.) Code

� Nvidia gpus: g80 gt200, fermi (512 cores)

� Big Installs CISRO(AUS), NCSA(USA)

� - Conveners: Borges, Goncalo

> WLCG - evolving for the future: Dr. BIRD, Ian� T2/T3 discussion

> CESGA Experience with the Grid Engine batch system: Mr. FREIRE GARCIA, Esteban� Oracle Grid engine

� With ARCo

� Large=400

� Interconnect builtin + ssh

� AllowUser=Admins

> CERN Grid Data Management Middleware plan for 2010: Oliver Keeble� Manageability, Performance, Standards (and therefore interoperabilty) ssl, nfs4.1, http/https

� Full CERN Support: FTS,DPM/LFC,gfal/lcg_util

� EGEE Site Deployment: The UMinho-CP case study

inside Grid and WLCGView details|Material|Export

� Presenter(s): Sá, Tiago (Uminho)EGEE is a dynamic organism with requirements that constantly evolve over time.The deployment of UMinho-CP - an EGEE site supporting Civil Protection related activities -, revealed new challenges, so...

� Deployment: Rocks toolkit not Quattor

Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 11

Security & Networking (Dr. David Kelsey)

> update on computer security: Dr. SCHWICKERATH, Ulrich

> IPv6 in HEP - a discussion: Dr. KELSEY, David

Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 12

Benchmarking

> Preliminary Measurements of Hep-Spec06 on the new multicoreprocessor Mr: MICHELOTTO, Michele

� Intel Nehalem -> Westmere 4(8) to 8 cores(12 logical cpu)

� AMD Instambul -> Magny-Cours 6 to 12 cores

� Compiler ready ?

> Hyperthreading influence on CPU performance: MARTTINS, Joao

� +20% without I/O

� +30 % with light I/O

� Advantage application specific

� Default OS cpu affinity not optimal for HT

� Recommendation: Now no HT

Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 13

Summary

> State of the Art Virtualisation

> More Data

> More „Green Computing“

> More Consolidation

> ITIL(IT Infrastructure Library) is still coming (slow)

> In fact the most challenging HEPiX meeting to organize!

� Everything reorganized over the WE

> Good side: EVO experience was a success!

� Most of the registered people connected during the 5 days

~60-65 connections in average� Smooth meeting despite being remote

� Need to add coffee breaks and dinner to EVO!!!

> HEPiX continues to attract new sites and new people

Thomas Finnern | Report on HEPiX Meeting Spring ‘10 | Page 14

Next Fall meeting (2010): Cornell University

> Ithaca, NY (south of lake Ontario)

> http://maps.google.fr/maps?client=opera&rls=fr&q=ithaca+new+york&sourceid=opera&oe=utf-8&um=1&ie=UTF-8&hq=&hnear=Ithaca,+NY,+USA&gl=fr&ei=6PrqSvypBoPclAfJqfj_BA&sa=X&oi=geocode_result&ct=image&resnum=1&ved=0CAsQ8gEwAA

> 1st week of November (Nov. 1-5)

> Web site available by end of May