1 Where do we go from here to “there”? A possible view of tomorrow’s computing world in HENP New trends and paradigms in HENP: where do we go from here

1

Where do w

e go from here to “there”?

A possible view of tom

orrow’s com

puting world in H

ENP

New trends and paradigms in HENP: where do we go from here to “there”?Jérôme LauretRHIC/STAR Software & Computing LeaderBrookhaven National Laboratory, NY, USA

2

Where do w



orrow’s com

puting world in H

ENP

Outline • HENP successes• New paradigms: a look ahead

• Cloud computing • From Grid to Cloud• Usage & the future

• Multi-core & Many-core • Evolution• How we did get there• Can we use it?

• Examples of multi-core problematic• More trends …

• Concluding remarks

3

Where do w



orrow’s com

puting world in H

ENP

HENP successBuilding a global infrastructure – Grids Sustain unprecedented data productionsBuild services to significant scales

4

Where do w



orrow’s com

puting world in H

ENP

Grids – LCG, …

1 M jobs/day

http://lcg.web.cern.ch/lcg/

5

Where do w



orrow’s com

puting world in H

ENP

Grids – OSG, …

http://www.opensciencegrid.org/

6

Where do w



orrow’s com

puting world in H

ENP

Key to success• Grid middleware, Grid schedulers, … • Ethernet [1 and 10 Gbits; LAN and WAN; ip v4]• MSS - High-density tapes and efficient access to low latency (high

capacity) storage• Data distribution access layers: Castor, dCache, xrootd, … pNFS• ROOT files (physics data)• Relational databases (metadata, all kind of SQL)• AFS, NAS, RAID• Batch systems • C++ frameworks – code re-usability, …

base-class, steering class• gcc / Linux

• With lots of data … data deluge …

7

Where do w



orrow’s com

puting world in H

ENP

STAR HPSS data rawHigh Performance Storage System

4.7 M files, 2.4 PBytes

http://www.google.com/url?sa=t&source=web&cd=1&ved=0CCYQFjAA&url=http://www.hpss-collaboration.org/&rct=j&q=hpss&ei=47yBTs-4LaXz0gGSxsyYAQ&usg=AFQjCNG3ySZ8X1S07Hey48ofUG2_wvtQlg&sig2=6jKrc8rTZR_1pGvjNmx_Dw&cad=rja

8

Where do w



orrow’s com

puting world in H

ENP

STAR HPSS production data~ 2.2 PBytes of “derived” data18 M files, 15 M files indexed

STAR resources ~ 6500 kSI200 @ BNL + remote sites

http://www.hpss-collaboration.org/

9

Where do w



orrow’s com

puting world in H

ENP

The Tianhe-1A, based at the National University of Defense Technology (NUDT) in China, can perform over 2.5 thousand trillion floating point operations per second (PetaFlops).

http://www.nytimes.com/2010/10/28/technology/28compute.html

10

Where do w



orrow’s com

puting world in H

ENP

Microsoft partners with China firm on cloud-computing

IBM helps clients excel in Cloud Computing, offering reliable and secure SaaS, IaaS, and PaaS solutions.

http://www.thetechlabs.com/latest/cloud-computing/

11

Where do w



orrow’s com

puting world in H

ENP

The world is changing – new trends appearedPetaFlops, Exascale, BlueGene, Cloud, multi core, many cores, ARM versus Intel, Xeon vs Atom vs … IPV6, …

Changes are fast coming (some fast going too)Can we integrate the new realities in current frameworks?

12

Where do w



orrow’s com

puting world in H

ENP

New paradigms, a look aheadCloud & parallelism

13

Where do w



orrow’s com

puting world in H

ENP

Cloud computing From Grids to Clouds

14

Where do w



orrow’s com

puting world in H

ENP

Cloud computing primer• No kitchen sink in the Cloud … yet• Base idea and building blocks

Infrastructure in layers• Infrastructure-as-a-Service (IaaS)

• Provide access & control to infrastructure • Ex: Amazon Web service - Through an

interface, ask for instances of an infrastructure (VM) and access it; raise, shutdown instances

• Platform-as-a-Service (PaaS)• Collection of software and tools maintained by your cloud provider. • Ex: Microsoft Azure - develops apps with Azure platform, move apps to

the Cloud (get rid of local services, connect from anywhere)• Software-as-a-Service (SaaS)

• Both hardware and software provided on the Cloud• Ex: Apple iCloud - Apps provided already and interact with Cloud (store

and retrieve content, DB, iTunes, … services)

15

Where do w



orrow’s com

puting world in H

ENP

Use in HENP – why and what is attractive?

• Are Grids usable and why a move?• Pro

• Grid are usable – outstanding use of resources. Efficiency > 97% (second try)• Grid operation support are world wide

• Cons• Grids are heterogeneous – you CANNOT guarantee on which platform your will run. Infrastructure is

controlled/maintained by provider. You need to “discover” resources• Complexity and dynamic• Troubleshooting is very inadequate (globus error # anyone?), exacerbated by heterogeneity (OS, batch,

any components, …)

• Cloud IaaS has virtualization at its heart • Cons

• No operational support, no helpdesk• Commercial mostly (for now)

• Pro• Clouds are usable – simplest form efficiency > 97%• You can (ex.) “provision” resources by packing ALL of your software, services, OS, environment

dependencies (startup and setup scripts) into one VM• You can test this VM at home and make sure your software stack works AND produce the same result

than on your home infrastructure / cluster (validation / QA)• You build the VM once, deploy as many times as needed• … VMS can be preserved – obsolescence and experiment AOL (long term preservation) near ensured

16

Where do w



orrow’s com

puting world in H

ENP

Motivation for Cloud – a software stack problem analysis

• Complex experimental application codes• STAR case: Developed over more than 10

years, by more than 100 scientists, comprises ~2.5 M lines of C++ and FORtran code

• Require complex, customized environments• Rely on the right combination of compiler

versions and available libraries • Dynamically load external libraries

depending on the task to be performed (system or third parties: ROOT, mysql, libxml, …)

2000/072001/10

2002/122004/02

2005/042006/01

2007/042007/10

2008/082009/03

2009/102010/02

0.00E+00

5.00E+05

1.00E+06

1.50E+06

2.00E+06

2.50E+06

3.00E+06C++A NS I-CFOR tranL ex, Yac cperlOther (tc l, awk, py-thon, ...)Shell (sh, c sh)

Little to NO opportunistic use of GridsCloud – pack once into a VM, use many times

17

Where do w



orrow’s com

puting world in H

ENP

Amazon EC2 (native)General recipe• Prepare VM (4-6 GB with OS and STAR

software altogether)• ~ 2 hours preparation to be done

once• Contextualize (EC2 specifics)• Ship it to EC2 (slow? 20 mnts, also a

one time job)• Login to EC2 and check the (new)

STAR VM image exist (this creates an AMI or Amazon Machine Image)

• Select STAR VM image you want to launch

• Select type of machine & # of copies• Must select SSH keys if you want to

use to communicate with this VM• Must select firewall (part of your EC2

setup)• The exciting part press “launch” … do

your physics• The not so painful part … pay

18

Where do w



orrow’s com

puting world in H

ENP

Models – virtualization at a glance

Sebastien Goasgen, Jérôme Lauret, Michael Fenn, Levente Hajdu

VOC

“on-demand” VM subscribe to external RMS. VMs forms anadditional network layer

STAR @ MIT, Adam Kocoloski Jan Balewski, Mathew Walker

Purely Web based + sshlogin possible. WN “see”the world

Kate Keahey, Jérôme Lauret, Tim Freeman, Levente Hajdu, Lidia Didenko

/EC2

Gatekeeper + WN form a virtual cluster. WN “see” the world

Miron Livny, Greg Thain, Jan Balewski, Matthew Walker, Jérôme Lauret

Condor/VM

Semi-standard GK used to start VMs.Private IP space, need SE + start/stopmechanism for VMs

Kestrel Sebastien Goasgen, Jérôme Lauret, Matthew Walker, Lance Stout

IM client controls VMsXMPP used for dispatch

CHEP 2010 - contribution ID 267, SCiDAC 2010 paper

http://117.103.105.177/MaKaC/contributionDisplay.py?contribId=267&sessionId=73&confId=3

http://drupal.star.bnl.gov/STAR/starnotes/public/sn0524

http://drupal.star.bnl.gov/STAR/starnotes/public/sn0524

19

Where do w



orrow’s com

puting world in H

ENP

Many success stories …• July 28th 2011 -

Magellan Tackles the Mysterious Proton Spin • June 1st 2011 - The case of the missing proton spin • March 24th 2010 -

Video of the week - RHIC’s hot quark soup • May 29th 2009 - Nimbus cloud project saves brainiacs

' bacon • May 2nd 2009 - Number Crunching Made Easy • April 8th 2009 -

Feature - Clouds make way for STAR to shine (also as OSG highlight)

• April 2nd 2009 - Nimbus and cloud computing meet STAR production demands (see also HPCwire)

• April 30th 2008 - The new Nimbus: first steps in the clouds

• September 2007 - CHEP 2007 OSG SUMS Workspace Demo (also attached)

• August 7th 2006 - SunGrid and the STAR Experiment (flyer attached)

Magellan Project – seamless use of Nimbus , Eucalyptus and OpenStack

What do we learn from “Cloudifying” National Laboratory Resources?• ACAT 2011,

contribution #58, Offloading peak processing to Virtual Farm by STAR experiment at RHIC

First near real-time data processing on the Cloud in HENP I know off.

Simple provider/Consumer Model of VMs coordination

Overall achievements:• Cloud processing boost for the “W” measurement (10 months became 3)• On-the-fly BUR reshape in 2011, data and result preview

Cloud computing paradigms is a HUGE success for STAR

http://www.nersc.gov/news-publications/science-news/2011/magellan-tackles-the-mysterious-proton-spin/

http://www.isgtw.org/feature/case-missing-proton-spin

http://www.isgtw.org/?pid=1002442

http://searchcloudcomputing.techtarget.com/news/article/0,289142,sid201_gci1357548,00.html



http://www.newsweek.com/id/195734


http://www.opensciencegrid.org/Clouds_make_way_for_STAR_to_shine

http://www.anl.gov/Media_Center/News/2009/news090402.html

http://www.anl.gov/Media_Center/News/2009/news090402.html

http://www.hpcwire.com/offthewire/Nimbus-and-Cloud-Computing-Meet-STAR-Production-Demands-42354742.html?page=1


http://workspace.globus.org/downloads/OSGCHEP.pdf

http://drupal.star.bnl.gov/STAR/system/files/OSGCHEP.pdf

http://www.sun.com/service/sungrid/brookhaven.pdf

http://www.sun.com/service/sungrid/brookhaven.pdf

http://drupal.star.bnl.gov/STAR/system/files/SunGrid-brookhaven.pdf

http://indico.cern.ch/contributionDisplay.py?contribId=58&confId=93877



20

Where do w



orrow’s com

puting world in H

ENP

The future?• Clouds represent an evolutionary step in

reaching the dream of “heterogeneity with confidence”. In layer• Software and environment fully packaged,

fully re-producible• Infrastructure seem to scale in size and

VMs• Resource “chunks” are carved our of larger

clusters• HENP only grazing Cloud: IaaS, PaaS

(Nimbus, …)

• Tomorrow?• Traditional Farms / clusters will disappear –

economy of scales• ExaScale and Mega-computers may replace

them• Will be hard for HENP …• But technology WILL exists to create

“virtual clusters”• “Infinite” power at reach from your laptop

21

Where do w



orrow’s com

puting world in H

ENP

Multi-core, many cores eraMany for > 100Onset of ExaScale computing

22

Where do w



orrow’s com

puting world in H

ENP

Disclaimer: New paradigms? Or not so new reality?Changes through improvementsChanges through Innovation

23

Where do w



orrow’s com

puting world in H

ENP

Many ways to see evolution …

Can process faster, smarter but still on two feet, two arms, …

…

Computers are as “simple” as they used to be 50 years ago. Under the hood• Still based on Von-Neumann

architecture (shared bus, shared memory) with little improvements (cache).

• Push words back and forth – Machine language very primitive

Bottlenecks and solution also the same – data in/out is a killer, CPU need to wait,caching strategies all over for ages (Harvard architecture)

24

Where do w



orrow’s com

puting world in H

ENP

Multi-core, many coresHow did we get there?

• Moore’s law: a self-fulfilling prophecy• CPU power increase by x2

every 2 years (18 months)

• 2003-2005: Intel “buzz killer” • speed highly related to

miniaturization of components

• 2005+: new strategy begins • Keep speed around 2-3 Ghz

(best 5.2 GHz for the z196)• Pack more CPU into the

same box: multi-core• Increase “other

dimensions”

Raw power increase era

Multi-coreera

http://en.wikipedia.org/wiki/Moore's_law

25

Where do w



orrow’s com

puting world in H

ENP

A many-dimension problem ~ 7Adapted from Svere Jarp, OpenLab

• The 7 dimensions to speed increase• Traditional

• Multiple computer nodes – embarrassingly parallel• Multi-core – one program uses as many cores

• Less traditional• Multi-sockets – provides more hardware parallelism (but hard to

program due to NUMA, Non-Uniform Memory Access)• Pipelining – instruction pipelining• Superscalar (MIMD Multiple Instructions, Multiple Data)• Vector widths/SIMD (Single Instructions, Multiple Data)• 4th is a pseudo-dimension – hardware multi-threading

• Perhaps even more dimensions• Precision – quadruple precisions and beyond …

26

Where do w



orrow’s com

puting world in H

ENP

Are we in good shape?• Multi-core alone: Community dominated by single threaded applications and

libraries• Algorithm: Kalman, …• Geant, ROOT, …

• Embarrassingly // is short sighted (at best)• Internal bandwidth and storage speed

• Memory would need to scale – no cost saving AND IO through BUS a killer

• Random IO to underlined device• Some mitigation (laptop area)

• More complex cache architectures – ReadyBoost, …• Solid State Drive, Flash-memory, hybrid-drives

• IO is also external: network, database, … standard schema challenged

• Fully parallel is not easy• Same IO issues – copy of data in/out of the box itself from many sources• Memory delays – copy of data in/out of memory costly

Estimated – HENP harvesting/exploitation of 10-20% at best of power

27

Where do w



orrow’s com

puting world in H

ENP

What are the show stoppers & difficulties?• Many API and approach

• Vector Class helps – this comes for “free” and should be used (SIMD)• Old fork() or POSIX threads methods• OpenMP, Threaded Building

Blocks (TBB), … GPU – CUDA, … OpenCL, … Intel Cilk

• Not all methods are compatible with each other

• thread synchronization issues

• So many dimensions, so many things to try – today, typically require deep knowledge of the architecture

Tilera Tile-GX, 100 cores

Register File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16Special Func Units x 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Instruction Cache

Nvidia / GPUTeraflops per cardFermi → Keppler→ Maxwell

Intel Single-Cloud computer, 48 coresin 24 tiles x 2 IA core per tile – 24 routercommunication

http://en.wikipedia.org/wiki/Application_programming_interface

28

Where do w



orrow’s com

puting world in H

ENPWith 5% sequential (easy with IO, initialization, input reading)

• 16 cores, S ~ 10 - 10/16 cores used at most or 62% of the machine power• 32 cores, S ~ 12 – 12/32 cores so now, drop to 32% • … by 512 cores, efficiency is 19/512 ~ 4% optimal usage

Entire processing driven by slowest portion – correlative: focus on partwith biggest gain (make the most common case fast), segregate the rest

Amdahl's law (1979)• Overall speed up drive by the

sequential part • For N processors and a P

portion fully parallel → S =

1/((1-P)+P/N)• Examples: • 0% is // (P=0), speed-up is 1• 50% is // (P=0.5), Smax → 2• 100% is // (P=1), speed-up is N

The “sequentiality”

http://en.wikipedia.org/wiki/Gene_Amdahl

http://en.wikipedia.org/wiki/Amdahl's_law

29

Where do w



orrow’s com

puting world in H

ENP

Some example of work, attempts and lesson learn

30

Where do w



orrow’s com

puting world in H

ENP

Problem # 1: IO issues not only localAccess to a standard database service in STAR

• Scale• STAR: 0.5M database queries per second (qps) at peak, Facebook (also MySQL) is doing 13M qps -

STAR is ~500 collaborators ( 1000 qps / user), Facebook is ~500 million ( 0.026 qps /user)• Twitter users produce ~ 600 tweets per second (56 M users), STAR doing ~600 FileCatalog requests

per second (do the math of query / per users)

• The problem:• Data is acquired online – each event is sorted and recorded in a “file” (but not necessarily in time order)

• Each file one job

each job one or more access to the database each job one timeline in the database

• Now, add to this:• Farm grows to a 4000 CPUs (16 cores each box)• Some event selections become “rare” – ‘File D’ records events once every 15-20 mnts (while information is @

1 Hz)

• Result: immediate database thrashing • Database efficiency cache re-use• “Sparsity” destroys the philosophy – standard approach challenged

Would still happen but MUCH later if program was fully parallel – x16 factor

31

Where do w



orrow’s com

puting world in H

ENP

Problem #1 (standard solutions)

Solutions: (a) use larger cache [more memory] (b) use faster cache (tune DBto have all index fit in memory, use SSD] (c ) re-order processing in strict time order (d) split database service, enhance API and “fit all in memory” [db snapshot] or perform Horizontal partitioning [see blog, CHEP 2010 contrib TBP]

Longer term: pre-fetching, client caching, WS, … or maximum DB or no DB

http://drupal.star.bnl.gov/STAR/blog/jeromel/2010/aug/04/stream-effect-database-summer-2010

32

Where do w



orrow’s com

puting world in H

ENP

Problem #2: memory copy overheadProblem #3: usability of API (diversity)• LHC/ALICE using CUDA and OpenCL for speeding a RAND function

• Test on an NVIDIA GeForce GTX 480 – 2 D sample, N=10^6, 4 Rand / point• Evaluation of a combined random number generator: use of a Tausworthe

with Linear Congruential Algorithm for a 2^121 periodicity • Each algorithm (Tx3, LCAx1) on one CPU / GPU and internal calculation split• Overall gain x10 (should be x1000) mainly due to moving data in/our of bus

• Other work …• CERN/OpenLab – porting

RooFit (likelihood fit) to CPU and GPU devices

• OpenMP and OpenCL (hybrid) can coexist – OpenMP easier

• CUDA preferred for optimal efficiency (vendor locking)

• Performance obtained with cache-blocking techniques

33

Where do w



orrow’s com

puting world in H

ENP

Problem #4: a combo of sequentiality, and little savingProblem #5: Language roadblock

• LHC experiment “A” tries event parallelization

• Findings• Saves memory (copy on write / memory shared until data is altered) - in fact, ~ 1.2-

1.4 GB / event shared …• Does NOT save time comparing to embarrassingly // (in fact, less efficient by a few

%)• fork() => near no code rewrite (but no real gain apart from memory)

• Recurrent theme in other methods => usually, reverse from STL and OO models to plain C-structs to achieve more compatible with wide range of API or reach minimum efficiency – code re-write.

• Event processed in separate sub-process (essentially fork()) , separate IO

• Merger at the end

Stepping away from C++ is a recurrent theme Is C++ a show stopper to parallel programing?

34

Where do w



orrow’s com

puting world in H

ENP

Dreams … and holy grail

35

Where do w



orrow’s com

puting world in H

ENP

Wouldn’t it be nice if …• A product / API would appear answering our prayers for “a”

standard • Should provide forward evolution

• Many tries – none crystalized: CT, OpenMP, …• OpenCL may be “a” way

• Embraced by Khronos group • Embraced by Mac• Embraced by ATI and AMD, …

http://www.khronos.org/opencl/

36

Where do w



orrow’s com

puting world in H

ENP

Language ?…• C/C++ “locking + threads” possible but hard to get it right, and even harder to

keep it right, efficient, portable• C++ OO models have lots of overheads, templates impossible to handle efficiently – often

best to step back to C• Prediction: a soon coming language or compiler/wrapper war• Much investment in HENP in C++, will be slow move

• Needs• A new language allowing for friendly parallelism • OR a compiler would allow auto-parallelization …

• Can go save us? • Addresses non-multi-core issues from yesterday and some multi-core issues of tomorrow

(type safe data communication & sync, garbage collection, reflection, …)• gcc-go coming – large exposure to be expected • BUT Binding to C++ is hardly possible, no dynamic libraries build and no dynamic library

loading, no operator overloads, …• Prediction: It may evolve or may not be “the“ language – but attempt is interesting and

we will learn from new concepts

http://golang.org/

37

Where do w



orrow’s com

puting world in H

ENP

Even more trends …

38

Where do w



orrow’s com

puting world in H

ENP

Some new trends …Svere Jarp, CERN/OpenLab

• Phones • Soon, there is one for every inhabitant on earth: 1,650,000,000 expected sold this year• Smart-phones: Approaching fast the 1 Billion devices: 480,000,000 this year; compound annual

growth (CAGR): 60%• Most (if not all) phones shipped in 2015 will be smart phones

• Tablets: 50,000,000, CAGR of 200%• In comparison:

• Netbooks/Notebooks (200,000,000)• Desktops (150,000,000)• Servers (10,000,000) with 55 BUSD in revenue

• Why this may be important• Cheer amount of Smartphone is “tempting” … a little calculation there big win• Tablets may drive “finger over mouse & keyboard” developments – Windows 8 & Next release of

ROOT is IOS based (simplified GUI)

• Other buzz• 300 Chinese fabless companies are springing up across the country• …The lesson: many changes may be lead by the private sector, booming

companies, new devices … Innovation → demand → paradigm shift

39

Where do w



orrow’s com

puting world in H

ENP

Do we have a clear path forward? Concluding remarks

40

Where do w



orrow’s com

puting world in H

ENP

Cloud computing• The path is rather clear

• Forget about Clusters and Batch systems• Mega-machine will be apportions into

virtual clusters & resources• Elastic computing – expand on

demand when you need it• Outsourcing is here and to stay

• New features will appear• Control from one laptop (or even a

SmartPhone)• Pay as you go• Market place - Highest bidder

http://www.thetechlabs.com/latest/cloud-computing/

41

Where do w



orrow’s com

puting world in H

ENP

Many core?• Clear path? Yes, as much as Mr Magoo

• I am sure we will be saved at the last mnt by a suddenly appearing elevator (i.e. solutions) … But we are not there yet

• For now, the community is “lost” or “searching” …

• For now• A multi-dimension / multi-disciplinary problem• Problem has a logistic / strategic design dimension: a “parallelize all or …”.?

Workflow split may be inevitable – some on many core, some on cloud• IO needs to be delegated, delayed, asynchronous, buffered, …

• What needs to happen• Experiments need to build teams with very broad knowledge

• We cannot concentrate on just one layer, ignoring the others• I view NO path to success without close collaboration of CS and Physicists

• API, libraries, must be agile and flexible • Common libraries, shared algorithms, .

• We likely need “a” new language or compiler savior – C++ WILL otherwise be challenged – a language war is coming

• A common strategy and architecture – where are the software architects?

42

Where do w



orrow’s com

puting world in H

ENP

Overall

• A computing landscape with potentials for dramatic paradigm shifts (de-localization of resources, vastly parallel, tactile devices, … 3D?)

• A world rich of opportunities for young scientists• Team work WILL be the way to success

See blue: UK faculty, staff, alumni and students - are united with a common purpose: to make the world a better place, a place where our students will lead lives of purpose and meaning

43

Where do w



orrow’s com

puting world in H

ENP

Documents

1 Where do we go from here to “there”? A possible view of tomorrow’s computing world in HENP New trends and paradigms in HENP: where do we go from here