prev

next

of 60

View

41Download

0

Tags:

Embed Size (px)

DESCRIPTION

4 th Gene Golub SIAM Summer School, 7/22 – 8/7, 2013, Shanghai. Factorization-based Sparse Solvers and Preconditioners. Xiaoye Sherry Li Lawrence Berkeley National Laboratory , USA xsli@ lbl.gov crd-legacy.lbl.gov /~ xiaoye /G2S3/. Acknowledgement. - PowerPoint PPT Presentation

PowerPoint Presentation

Factorization-based Sparse Solvers and PreconditionersXiaoye Sherry LiLawrence Berkeley National Laboratory, USAxsli@lbl.gov

crd-legacy.lbl.gov/~xiaoye/G2S3/

4th Gene Golub SIAM Summer School, 7/22 8/7, 2013, Shanghai

1AcknowledgementJim Demmel, UC Berkeley, course on Applications of Parallel Computers: http://www.cs.berkeley.edu/~demmel/cs267_Spr13/John Gilbert, UC Santa Barbara, course on Sparse Matrix Algorithms: http://cs.ucsb.edu/~gilbert/cs219/cs219Spr2013/Patrick Amestoy, Alfredo Buttari, ENSEEIHT, course on Sparse Linear AlgebraJean-Yves LExcellent, Bora Uar, ENS-Lyon, course on High-Performance Matrix ComputationsArtem Napov, Univ. of BrusselsFrancois-Henry Rouet, LBNLMeiyue Shao, EPFLSam Williams, LBNLJianlin Xia, Shen Wang, Purdue Univ.2Course outlineFundamentals of high performance computingBasics of sparse matrix computation: data structure, graphs, matrix-vector multiplicationCombinatorial algorithms in sparse factorization: ordering, pivoting, symbolic factorizationNumerical factorization & triangular solution: data-flow organizationParallel factorization & triangular solutionPreconditioning: incomplete factorizationPreconditioning: low-rank data-sparse factorizationHybrid methods: domain decomposition, substructuring method

Course materials online: crd-legacy.lbl.gov/~xiaoye/G2S3/

3Lecture 1 Fundamentals: Parallel computing, Sparse matricesXiaoye Sherry LiLawrence Berkeley National Laboratory, USAxsli@lbl.gov4th Gene Golub SIAM Summer School, 7/22 8/7, 2013, Shanghai

4 5Lecture outlineParallel machines and programming modelsPrinciples of parallel computing performanceDesign of parallel algorithmsMatrix computations: dense & sparsePartial Differential Equations (PDEs)Mesh methodsParticle methodsQuantum Monte-Carlo methodsLoad balancing, synchronization techniques

Parallel machines & programming model(hardware & software)67Idealized Uniprocessor ModelProcessor names bytes, words, etc. in its address spaceThese represent integers, floats, pointers, arrays, etc.Operations includeRead and write into very fast memory called registersArithmetic and other logical operations on registersOrder specified by programRead returns the most recently written dataCompiler and architecture translate high level expressions into obvious lower level instructions (assembly)

Hardware executes instructions in order specified by compilerIdealized CostEach operation has roughly the same cost(read, write, add, multiply, etc.)A = B + C Read address(B) to R1Read address(C) to R2R3 = R1 + R2Write R3 to Address(A)CS267 Lecture 278Uniprocessors in the Real WorldReal processors haveregisters and cachessmall amounts of fast memorystore values of recently used or nearby datadifferent memory ops can have very different costsparallelismmultiple functional units that can run in paralleldifferent orders, instruction mixes have different costspipelininga form of parallelism, like an assembly line in a factoryWhy need to know this?In theory, compilers and hardware understand all this and can optimize your program; in practice they dont.They wont know about a different algorithm that might be a much better match to the processorCS267 Lecture 28In the old days, the computer architectures are much simpler, compiler can do a very good job.Parallelism within single processor pipeliningLike assembly line in manufacturingInstruction pipeline allows overlapping execution of multiple instructions with the same circuitry

Sequential execution: 5 (cycles) * 5 (inst.) = 25 cyclesPipelined execution: 5 (cycles to fill the pipe, latency) + 5 (cycles, 1 cycle/inst. throughput) = 10 cyclesArithmetic unit pipeline: A FP multiply may have latency 10 cycles, but throughput of 1/cyclePipeline helps throughput/bandwidth, but not latency

9

IF = Instruction FetchID = Instruction DecodeEX = ExecuteMEM = Memory accessWB = Register write backParallelism is everywhere. Hard to find a machine without parallelism.

9Parallelism within single processor SIMDSIMD: Single Instruction, Multiple Data

10+XYX + Y+x3x2x1x0y3y2y1y0x3+y3x2+y2x1+y1x0+y0XYX + YSlide Source: Alex Klimovitski & Dean Macri, Intel CorporationScalar processingtraditional modeone operation producesone resultSIMD processingwith SSE / SSE2SSE = streaming SIMD extensionsone operation produces multiple results 128-bits vector register.1011SSE / SSE2 SIMD on IntelSSE2 data types: anything that fits into 16 bytes, e.g.,

Instructions perform add, multiply etc. on all the data in this 16-byte register in parallelChallenges:Need to be contiguous in memory and alignedSome instructions to move data around from one part of register to anotherSimilar on GPUs, vector processors (but many more simultaneous operations)16x bytes4x floats2x doublesCS267 Lecture 211Would hope compiler would recognize this, but not always possible.You need to build data structure appropriately, eg aligned on caches line boundaries, i.e. first word has enough trailing zeros in addressVariety of node architectures

12

Cray XE6: dual-socket x 2-die x 6-core, 24 cores Cray XC30: dual-socket x 8-core, 16 cores Cray XK7: 16-core AMD + K20X GPU Intel MIC: 16-core host + 60+ cores co-processor Socket == Processor12 13TOP500 (www.top500.org)Listing of 500 fastest computersMetric: LINPACK benchmarkHow fast is your computer? = How fast can you solve dense linear system Ax=b?Current records (June, 2013)

Rank MachineCoresLinpack(Petaflop/s)Peak(Petaflop/s)1Tianhe-2 Intel MIC(China National Univ. of Defense Technology)3,120,00033.8

(61%)54.92Titan Cray XK7(US Oak Ridge National Lab)560, 64017.6(65%)27.13Sequoia BlueGene/Q(US Lawrence Livermore National Lab)1,572,86417.1(85%)20.113Internationally recognized ranking site 14Units of measure in HPCHigh Performance Computing (HPC) units are:Flop: floating point operationFlops/s: floating point operations per secondBytes: size of data (a double precision floating point number is 8)Typical sizes are millions, billions, trillionsMegaMflop/s = 106 flop/secMbyte = 220 = 1048576 ~ 106 bytesGigaGflop/s = 109 flop/secGbyte = 230 ~ 109 bytesTeraTflop/s = 1012 flop/secTbyte = 240 ~ 1012 bytes PetaPflop/s = 1015 flop/secPbyte = 250 ~ 1015 bytesExaEflop/s = 1018 flop/secEbyte = 260 ~ 1018 bytesZettaZflop/s = 1021 flop/secZbyte = 270 ~ 1021 bytesYottaYflop/s = 1024 flop/secYbyte = 280 ~ 1024 bytes 15Memory Hierarchy Flops is not everythingMost programs have a high degree of locality in their accessesspatial locality: accessing things nearby previous accessestemporal locality: reusing an item that was previously accessedMemory hierarchy tries to exploit locality to improve averageon-chip cacheregistersdatapathcontrolprocessorSecond level cache (SRAM)Main memory(DRAM)Secondary storage (Disk)Tertiary storage(Disk/Tape)Speed1ns10ns100ns10ms10secSizeKBMBGBTBPBCS267 Lecture 215Hopper Node TopologyUnderstanding NUMA Effects [J. Shalf]

Say thick = full linkThin = linkComplicated - in order to use it effectively does it matter ?Arithmetic IntensityArithmetic Intensity (AI) ~ Total Flops / Total DRAM BytesE.g.: dense matrix-matrix multiplication: n3 flops / n2 memory

Higher AI better locality amenable to many optimizations achieve higher % machine peak 17A r i t h m e t i c I n t e n s i t yO( N )O( log(N) )O( 1 )SpMV, BLAS1,2Stencils (PDEs)Lattice MethodsFFTsDense Linear Algebra(BLAS3)Nave Particle MethodsPIC codes[S. Williams]Roofline model (S. Williams)basic conceptSynthesize communication, computation, and locality into a single visually-intuitive performance figure using bound and bottleneck analysis.Assume FP kernel maintained in DRAM, and perfectly overlap computation and communication w/ DRAMArithmetic Intensity (AI) is computed based on DRAM traffic after being filtered by cacheQuestion : is the code computation-bound or memory-bound?

Time is the maximum of the time required to transfer the data and the time required to perform the floating point operations.18Bytes / STREAM BandwidthFlops / Flop/stimeRoofline modelsimple boundRooflineGiven the code AI, can inspect the figure to bound performanceProvides insights as to which optimizations will potentially be beneficial

Machine-dependent, code-dependent19AttainablePerformanceij= minFLOP/s (with Optimizations1-i)AI * Bandwidth (with Optimizations1-j)actual FLOP:Byte ratioOpteron 2356 (Barcelona)0.51.01/82.04.08.016.032.064.0128.0256.01/41/2124816peak DPStream Bandwidthattainable GFLOP/sExample20

Consider the Opteron 2356:Dual Socket (NUMA)limited HW stream prefetchersquad-core (8 total)2.3GHz2-way SIMD (DP)separate FPMUL and FPADDdatapaths4-cycle FP latency

Assuming expression of parallelism is the challenge on this architecture, what would the roofline model look like ?

Roofline ModelBasic Concept21Naively, one might assume peak performance is always attainable.attainable GFLOP/sOpteron 2356(Barcelona)0.51.02.04.08.016.032.064.0128.0256.0peak DPRoofline ModelBasic Concept22However, with a lack of locality, DRAM bandwidth can be a bottleneck

Plot on log-log scaleGiven AI, we can easily bound performanceBut architectures are much more complicated

We will bound performance as we eliminate specific forms of in-core parallelismactual FLOP:Byte ratioattainable GFLOP/sOpteron 2356(Barcelona)0.51.01/82.04.08.016.032.064.0128.0256.01/41/2124816peak DPStream BandwidthRoofline Modelcomputational ceilings23Opterons have dedicated multipliers and adders.If the code is dominated by adds, then attainable performance is half of peak.We call these CeilingsThey act like constraints on performance actual FLOP:Byte ratioattainable GFLOP/sOpteron 2356(Ba