What is « HPC » ? - CentraleSupelec | ?· 2018-01-09 · ... HPC roadmap High Performance Hardware…

  • Published on

  • View

  • Download


<ul><li><p>1</p><p>Stphane Vialle</p><p>Stephane.Vialle@centralesupelec.fr http://www.metz.supelec.fr/~vialle</p><p>M2-ILC Paralllisme, systmes distribus et grille</p><p>A short overview of High Performance Computing</p><p>What is HPC ?</p><p>2</p><p>High performance hardware</p><p>Parallel programming</p><p>Serial optimization tillingVectorizationMultithreadingMessage passing</p><p>Intensive computing &amp; large scale applications</p><p>High Performance Computing</p><p>Big Data</p><p> HPDA </p><p>Numerical algorithms</p><p>TeraBytesPetaBytesExaBytes</p><p>TeraFlopsPetaFlopsExaFlops</p><p>3</p><p>High Performance Hardware</p><p> high performance hardware</p><p>Inside .</p></li><li><p>2</p><p>High Performance Hardware</p><p>From core to SuperComputer</p><p>4</p><p>Computingcores</p><p>High Performance Hardware</p><p>From core to SuperComputer</p><p>5</p><p>Computingcores</p><p>Multi-coreprocessor</p><p>High Performance Hardware</p><p>From core to SuperComputer</p><p>6</p><p>Computingcores</p><p>Multi-coreprocessor</p><p>Multi-corePC/node</p></li><li><p>3</p><p>High Performance Hardware</p><p>From core to SuperComputer</p><p>7</p><p>Computingcores</p><p>Multi-coreprocessor</p><p>Multi-corePC/node</p><p>Multi-corePC cluster</p><p>High Performance Hardware</p><p>From core to SuperComputer</p><p>8</p><p>Computingcores</p><p>Multi-coreprocessor</p><p>Multi-corePC/node</p><p>Multi-corePC cluster</p><p>Super-Computer</p><p>High Performance Hardware</p><p>From core to SuperComputer</p><p>9</p><p>Computingcores</p><p>Multi-coreprocessor</p><p>Multi-corePC/node</p><p>Multi-corePC cluster</p><p>Super-Computer</p><p>+ hardwareaccelerators</p><p>Fat Treeinterconnect</p></li><li><p>4</p><p>Fat Treeinterconnect</p><p>High Performance Hardware</p><p>What are the problems ?</p><p>10</p><p>Computingcores</p><p>Multi-coreprocessor</p><p>Multi-corePC/node</p><p>Multi-corePC cluster</p><p>Super-Computer</p><p>+ hardwareaccelerators</p><p>Fat Treeinterconnect</p><p>Programming</p><p>Energy consumptionFailures/MTBF</p><p>High Performance Hardware</p><p>Top500 site: HPC roadmap</p><p>High Performance Hardware</p><p>Top500 site: HPC roadmap</p><p>Decline?</p></li><li><p>5</p><p>13</p><p>Architecture issues</p><p>Why multi-core processors ? Shared or distributed memory ?</p><p>14</p><p>Processor performance increase due to parallelism since 2004:</p><p>W. KirshenmannEDF &amp;EPI AlGorille, </p><p>daprs une premire tude menepar SpiralGen</p><p>CPU performance increase is due to (different) parallelmechanisms sinceseveral years</p><p> and they requireexplicit parallelprogramming</p><p>Architecture issuesRe-interpreted Moores law</p><p>Auteur :Jack Dongara</p><p>Impossible to dissipate so</p><p>muchelectrical</p><p>power in the silicium </p><p>and to pay for </p><p>consuming somuch energy</p><p>Architecture issuesRe-interpreted Moores law</p></li><li><p>6</p><p>Auteur :Jack Dongara</p><p>To boundfrequency and to increase the nb of cores isenergy efficient</p><p>20.753 = 0.8</p><p>Architecture issuesRe-interpreted Moores law</p><p>Initial (electronic) Moores law: each 18 months x2 number of transistors per m</p><p>Previous computer science interpretation: each 18 months x2 processor speed</p><p>New computer science interpretation: each 24 months x2 number of cores</p><p>Leads to a massive parallelism challenge: to split many codes in 100, 1000, .. 106 threads 107 threads!!</p><p>Architecture issuesRe-interpreted Moores law</p><p>18</p><p>Architecture issues</p><p>3 classic parallel architecturesShared-memory machines (Symetric MultiProcessor):</p><p>Overview of Recent Supercomputers</p><p>Aad J. van der Steen Jack J. Dongarra</p><p>Memory</p><p>Network</p><p>One principle:- several</p><p>implementations,- different costs,- different speeds.</p></li><li><p>7</p><p>19</p><p>Mem</p><p>proc</p><p>Mem</p><p>proc</p><p>Mem</p><p>proc</p><p>network</p><p>Hypercubes</p><p>Fat trees</p><p>Gigabit Ethernet </p><p>Architecture issues </p><p>3 classic parallel architecturesDistributed-memory machines (clusters):</p><p>Cluster basic principles, but cost and speed dependon the interconnectionnetwork ! Highly scalable</p><p>architecture</p><p>20</p><p>Distributed Shared Memory machines (DSM):</p><p>Overview of Recent Supercomputers</p><p>Aad J. van der Steen Jack J. Dongarra</p><p>Hardware implementation: fast &amp; expensiveSoftware implementation: slow &amp; cheap !</p><p>cache coherence Non Uniform Memory Architecture (ccNUMA)</p><p>Extends the cache mechanismnetwork</p><p>Up to 1024 nodesSupport global multithreading</p><p>Architecture issues</p><p>3 classic parallel architectures</p><p>21</p><p>Architecture issues</p><p>3 classic parallel architecturesMem</p><p>proc</p><p>Mem</p><p>proc</p><p>Mem</p><p>proc</p><p>network</p><p>Shared memory SMP </p><p>Mem</p><p>proc</p><p>Mem</p><p>proc</p><p>Mem</p><p>proc</p><p>network</p><p>Distributed memory Cluster </p><p>Distributed sharedmemory DSM</p><p>network</p><p>.</p><p> Simple and efficient up to 16 processors. Limited solution</p><p> Unlimited scalability. But efficiency and price depend on the interconnect.</p><p> Comfortable and efficient solution. Efficient hardware implementation up to 1000 processors.</p><p>2016 : almost all supercomputers have a cluster-like architecture</p></li><li><p>8</p><p>22</p><p>Architecture issues</p><p>Evolution of parallel architectures</p><p>Single processorSMPPure SIMDShM vector machine</p><p>MPP (cluster with custom interconnect)Cluster</p><p>2016 : almost all supercomputers have a cluster-like architecture</p><p>1993 2017 1993 2017</p><p>% of 500 systems in Top500% of computing power in Top500</p><p>BUT </p><p>Architecture issues</p><p>Modern parallel architecture</p><p>Mem</p><p>proc</p><p>Mem</p><p>proc</p><p>Mem</p><p>proc</p><p>network</p><p>One PC cluster One PC / One node</p><p>One NUMA node</p><p>network</p><p>Cluster architecture:</p><p>Architecture issues</p><p>Modern parallel architecture</p><p>network</p><p>Cluster of NUMA nodes</p><p>Time to access a data in the RAM of the processor </p><p> Hierarchical architecture : {data thread} location become critic</p><p>&lt; Time to send/recv a data to/from another node</p><p>&lt; Time to access a data in a RAM attached to another processor (several access times exist inside one NUMA node)</p></li><li><p>9</p><p>25</p><p>Architecture issues </p><p>Current standard HPC systems</p><p>Cluster of multi-processor NUMA nodes with hardware </p><p>vector accelerators</p><p> multi-paradigms programming(or new high level paradigm)</p><p>AVX vectorization:#pragma simd</p><p>+ CPU multithreading:#pragma omp parallel for</p><p>+ Message passing:MPI_Send(, Me-1,); MPI_Recv(..., Me+1,);</p><p>+ GPU vectorizationmyKernel()</p><p>+ checkpointing</p><p>Hierarchical and hybridarchitectures :</p><p>26</p><p>Fault Tolerance in HPC</p><p>Can we run a very large and long parallel computation, and succeed ?</p><p>Can a one-million core parallel program run during one week ?</p><p>27</p><p>Fault tolerance</p><p>Mean Time Between Failures</p><p>MTBF definition:</p></li><li><p>10</p><p>28</p><p>Fault tolerance</p><p>Mean Time Between Failures</p><p>The Cray-1 required extensive maintenance. Initially, MTBF was on the order of 50 hours. MTBF is Mean Time Between Failures, and in this case, it was the average time the Cray-1 worked without any failures. Two hours of everyday was typically set aside for preventive maintenance. (Cray-1 : 1976)</p><p>System Resilience at Extreme Scale White Paper</p><p>Prepared for Dr. William Harrod, Defense Advanced Research Project Agency (DARPA)</p><p>Today, 20% or more of the computing capacity in a large high-performance computing system is wasted due to failures and recoveries. Typical MTBF is from 8 hours to 15 days. As systems increase in size to field petascale computing capability and beyond, the MTBF will go lower and more capacity will be lost. </p><p>Experiments:</p><p>Addressing Failures in Exascale Computingreport produced by a workshop on Addressing Failures in Exascale Computing </p><p>2012-2013 </p><p>29</p><p>Processor frequency is limited and number of cores increases we use more and more cores</p><p>We use more and more cores during the same time probability of failure increases!</p><p>We do not attempt to speedup our applications we process larger problems in constant time ! (Gustafsons law)</p><p>We (really) need for fault toleranceor large parallel applications will never end!</p><p>Fault tolerance</p><p>Why do we need fault tolerance ?</p><p>30</p><p>Fault tolerance</p><p>Fault tolerance strategies</p><p>Fault tolerance in HPC remains a hot topic </p><p>High Performance Computing: big computations (batch mode)Checkpoint/restart is the usual solutionComplexify src code, time consumming, disk consumming !</p><p>High Throuhgput Computing: flow of small and time constrained tasksSmall and independent tasksA task is re-run (entirely) when failure happens</p><p>Big Data:Data storage redundancyComputation on (frequently) incomplete data sets </p><p>Differentapproach !</p></li><li><p>11</p><p>31</p><p>Fault tolerance</p><p>Who need for fault tolerance ?In a HPC cluster: computing ressources are checked regularlyWrong ressources are identified and not allocatedMany users do not face frequent failures (good!)</p><p>(parallel computers are not so bad !)</p><p>Which users/applications need for fault tolerance ?When running applications on large numbers of</p><p>ressources during long times Need to restart from a recent checkpoint</p><p>Remark:Critical parallel applications (with strong dead lines)Need for redundant ressources and runs Impossible on very large parallel runs</p><p>Energy Consumption</p><p>32</p><p>1 PetaFlops: 2.3 MW ! 1 ExaFlops : 2.3 GW !! 350 MW ! .. 20 MW ?</p><p>Perhaps we will be able to build the machines,but not to pay for the energy consumption !!</p><p>?</p><p>33</p><p>Energy consumption</p><p>From Petaflops to Exaflops</p><p>1.03 Petaflops : June 2008RoadRunner (IBM)Opteron + PowerXCell122440 cores 500 Gb/s (IO)2.35 Mwatt !!!!!!</p><p>1.00 Exaflops : 2018-20202020-2022</p><p>25 Tb/s (IO)20 35 MWatt max.</p><p>1000 perf 100 cores/node 10 nodes 50 IO 10 energy</p><p>(only 10)</p><p>93.01 Petaflops : juin 2016Sunway TaihuLight (NRCPC - China)Sunway SW26010 260C 1.45GHz10 649 600 cores 15.4 MWatt</p><p> How to program these machines ? How to train large developer teams ?</p></li><li><p>12</p><p>34</p><p>93.0 Pflops 41 000 processeurs Sunway SW26010 260C 1.45GHz 10 649 600 cores </p><p> Sunway interconnect:5-level integrated hierarchy(Infiniband like ?)</p><p>15.4 MWatt</p><p>Energy consumption</p><p>Sunway TaihuLight: N1 in 2016</p><p>35</p><p>Cooling</p><p>Cooling is close to 30% of the energy consumption</p><p>Optimization is mandatory!</p><p>36</p><p>Des processeurs moins gourmands en nergie : on essaie de limiter la consommation de chaque processeur les processeurs passent en mode conomique sils sont inutiliss on amliore le rendement flops/watt</p><p>Mais une densit de processeurs en hausse : une tendance la limitation de la taille </p><p>totale des machines (en m au sol)</p><p> Besoin de refroidissement efficaceet bon march (!)</p><p>Souvent estim 30% de la dpense nergtique!</p><p>Une carte dunIBM Blue Gene</p><p>Cooling</p><p>Cooling is strategic !</p></li><li><p>13</p><p>37</p><p>Cooling</p><p>Liquid and immersive cooling</p><p>Refroidissement par immersion des cartes dans un liquide lectriquement neutre, et refroidi.</p><p>Refroidissement liquide par immersion sur le CRAY-2 en 1985</p><p>Refroidissement liquide par immersion test par SGI &amp; Novec en 2014</p><p>38</p><p>Cooling</p><p>Cold doors (air+water cooling)On refroidit par eau une porte/grille dans laquelle circule un flux dair, qui vient de refroidir la machine </p><p>Le refroidissement se concentre sur larmoire.</p><p>39</p><p>Cooling</p><p>Direct liquid coolingOn amne de leau froide directement sur le point chaud, mais leau reste isole de llectronique.</p><p> Exprimental en 2009 Adopt depuis (IBM, BULL, )</p><p>Lame de calcul IBM en 2012CommercialiseCarte exprimentale IBM en 2009</p><p>(projet Blue Water)</p></li><li><p>14</p><p>40</p><p>Cooling</p><p>Optimized air flowOptimisation des flux dair : en entre et en sortie des armoires</p><p> Architecture Blue Gene : haute densit de processeurs Objectif dencombrement minimal (au sol) et de consommation</p><p>nergtique minimale Formes triangulaires ajoutes pour optimiser le flux dair</p><p>IBM Blue Gene</p><p>41</p><p>Refroidissement avec de lair temprature ambiante :- circulant grande vitesse - circulant gros volume</p><p> Les CPUs fonctionnent proche de leur temprature max supportable (ex : 35C sur une carte mre sans pb)</p><p> Il ny a pas de refroidissement du flux dair.</p><p>Economique !</p><p>Mais arrt de la machine quand lair ambiant est trop chaud (lt) !</p><p>Une machine deGrid5000 Grenoble</p><p>(la seule en Extreme Cooling)</p><p>Cooling</p><p>Extreme air cooling</p><p>42</p><p>Interesting history of CRAY company</p></li><li><p>15</p><p>43</p><p>Cray-1, 1976 133Mflops Cray-2, 19851.9 gigaflops</p><p>Cray-C90, 199116 gigaflops</p><p>Cray-YMP, 1988</p><p>Cray-J90Cray-T90,</p><p>60 gigaflops</p><p>Architecture des machines paralllesHistoire des ordinateurs CRAY</p><p>44</p><p>Cray-T90,60 gigaflops</p><p>Cray-SV11 teraflop</p><p>Cray-SV2</p><p>Cray-SX-6</p><p>NEC (SX)</p><p>Cray est dmembr et semble avoir disparu.</p><p>Puis en 2002 un vnement survient .</p><p>Accord dimportation</p><p>Architecture des machines paralllesHistoire des ordinateurs CRAY</p><p>45</p><p>Apparition du Earth Simulator : gros cluster vectoriel NEC : 640-nuds de 8 processeurs : 5120 processeurs 40 Tflops crte, a atteint les 35 Tflops en juin 2002</p><p> VectorMPP</p><p>Le vectoriel revient la 1re place du Top500</p><p>(en 2002) !</p><p>Architecture des machines paralllesHistoire des ordinateurs CRAY</p></li><li><p>16</p><p>46</p><p>Forte inquitudedes USA !</p><p>Architecture des machines paralllesHistoire des ordinateurs CRAY</p><p>47</p><p>CRAY tait de nouveau l, avec de grandes ambitions:</p><p>Architecture des machines paralllesHistoire des ordinateurs CRAY</p><p>48</p><p>Cray-T90,60 gigaflops</p><p>Cray-SV11 teraflop</p><p>Cray-SV2</p><p>Cray-SX-6</p><p>Cray-X1 52.4 Tflops</p><p>Vector MPPMachine annonce</p><p>Architecture des machines paralllesHistoire des ordinateurs CRAY</p></li><li><p>17</p><p>49</p><p>Cray-X1 52.4 Tflops</p><p>Vector MPP</p><p>Cray-XT3 Cray-XT4</p><p>Cray-XT5cluster de CPU</p><p>multicoeurs, Linux</p><p>Cray-XT5h (hybrid)cluster de noeuds </p><p>CPU/Vectoriels/FPGA,Unicos (Cray Unix)</p><p>Cray-XT6 ou XT6h (?)</p><p>Opteron 6-coresTore 2D</p><p>Rseau Cray</p><p>Architecture des machines paralllesHistoire des ordinateurs CRAY</p><p>Cray XT6 :1er au top500 en novembre 2009 : 1.7 Pflops avec 6.9 MwattArchitecture : rseau dinterconnexion propritaire + Opteron 6-cursArchitectures traditionnelles et trs consommatrices dnergie</p><p>mais trs efficace et sous Linux (logiciels disponibles)Machine dnomme Jaguar Cray de nouveau la 1re place </p><p>en nov 2009 avec des Opteron</p><p>Architecture des machines paralllesHistoire des ordinateurs CRAY</p><p>Cray XK7 : entre des acclrateurs dans la gamme CRAY1er au top500 en novembre 2012 : 17.6 Pflops avec 8.2 MwattArchitecture : </p><p> rseau dinterconnexion propritaire chaque nud : Opteron 16-curs + GPU NVIDIA Tesla K20 18688 nuds 299008 CPU cores + 18688 GPU K20</p><p>Cray la 1re place en nov2012 avec Opteron + GPU </p><p>Architecture des machines paralllesHistoire des ordinateurs CRAY</p></li><li><p>18</p><p>Cray XK7 : entre des acclrateurs dans la gamme CRAY1er au top500 en novembre 2012 : 17.6 Pflops avec 8.2 MwattArchitecture : </p><p> rseau dinterconnexion propritaire chaque nud : Opteron 16-curs + GPU NVIDIA Tesla K20 18688 nuds 299008 CPU cores + 18688 GPU K20</p><p>Cray la 3me place en juin 2016 avec Opteron + GPU </p><p>Architecture des machines paralllesHistoire des ordinateurs CRAY</p><p>53</p><p>A short overview of High Performance Computing</p><p>Questions ?</p></li></ul>


View more >