Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Vivek Sarkar
Department of Computer ScienceRice University
August 27, 2007
COMP 635: Seminar on HeterogeneousProcessors
www.cs.rice.edu/~vsarkar/comp635
2COMP 635, Fall 2007 (V.Sarkar)
Course Goals
• Gain familiarity with heterogeneous processor systems bystudying a few sample design points in the spectrum
• Study and critique current software environments for thesedesigns (programming models, compilers, tools, runtimes)
• Discuss research challenges in advancing the state of the artof software for heterogeneous processors
• Target audience: software, hardware, and applicationresearchers interested in building or using heterogeneousprocessor systems, or understanding strengths andweaknesses of heterogeneous processors w.r.t. their researchareas
3COMP 635, Fall 2007 (V.Sarkar)
Course Organization• Class dates (12 lectures)
— 8/27, 9/10, 9/20 (Thurs), 9/24, 10/1, 10/8, 10/22, 10/29, 11/5, 11/19, 11/26, 12/3— No classes on 9/3 (Labor Day), 10/15 (Midterm Recess), 11/12 (Supercomputing 2007
conference week)— No class on 9/17 (Mon); we will meet on 9/20 (Thurs) instead that week
• Time & Place— Default: Mondays, 3:30pm - 4:30pm, DH 2014— Exception: time & place for 9/20 (Thurs) lecture TBD— 30 minutes reserved after lecture for discussion (optional)
• Office Hours (DH 3131)— 11am - 12noon, Fridays from 8/31/07 to 12/7/07
• OWL-Space repository: COMP 635 F07
• Grading— Satisfactory/unsatisfactory grade for students taking seminar for credit
– Others should register officially as auditors, if possible— For a satisfactory grade, you need to
1. Attend at least 50% of lectures2. Submit a 4-page project/study report by 12/7/07 (report can be prepared in a group - just
plan on 4 pages/person in that case)— Optional in-class presentation of project/study report on 12/3/07
4COMP 635, Fall 2007 (V.Sarkar)
Course Content• Introduction to Heterogeneous Processors and their Programming
Models (1 lecture)
• Cell Processor and Cell SDK (2 lectures)
• Nvidia GPU and CUDA programming environment (2 lectures)
• DRC FPGA Coprocessor Module and Celoxica ProgrammingEnvironment (1 lecture)
• Clearspeed Accelerator and SDK (1 lecture)
• Imagine Stream Processor (1 lecture)
• Microsoft Accelerator Library (1 lecture)
• Vector and SIMD processors -- a historical perspective (1 lecture)
• Programming Model and Runtime Desiderata for futureHeterogeneous Processors (1 lecture)
• Student presentations (1 lecture)
5COMP 635, Fall 2007 (V.Sarkar)
COMP 635 Lecture 1: Introduction toHeterogeneous Processors and their
Programming Models
6COMP 635, Fall 2007 (V.Sarkar)
Acknowledgments
• Georgia Tech ECE 6100, Module 14— Vince Mooney, Krishna Palem, Sudhakar Yalamanchili—http://www.ece.gatech.edu/academic/courses/fall2006/ece6100/Class/ind
ex.html
• MIT 6.189 IAP 2007, Lecture 2—“Introduction to the Cell Processor”, Michael Perrone— http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf
• UIUC ECE 497, Lecture 16—courses.ece.uiuc.edu/ece412/lectures/lecture16.ppt
• UIUC ECE 498 AL1, Programming Massively Parallel Processors— David Kirk, Wen-mei Hwu—http://courses.ece.uiuc.edu/ece498/al1/Syllabus.html
7COMP 635, Fall 2007 (V.Sarkar)
Heterogeneous Processors
ACC
LOCALMEMORY
ACC
MA
I NM
EMO
RY
GPP
MTM
ACC
LOCALMEMORY
Memory transfermoduleschedulessystem-wide bulkdata movement
General-purpose processororchestrates activity
Accelerators can usescheduled, streamingcommunication…
or can operate onlocally-buffered datapushed to them inadvance
Accelerated activities and associated private dataare localized for bandwidth, power, efficiency
Motivation:
1) Different parts of programs have differentrequirements
Control-intensive portions need goodbranch predictors, speculation, bigcaches to achieve good performance
Data-processing portions need lots ofALUs, have simpler control flows
2) Power consumptionFeatures like branch prediction, out-of-
order execution, tend to have veryhigh power/performance ratios.
Applications often have time-varyingperformance requirements
8COMP 635, Fall 2007 (V.Sarkar)
Sample Application Domains forHeterogeneous Processors
• Cell Processor— Medical imaging, Drug discovery, Reservoir modeling, Seismic analysis,
…
• GPU (e.g., Nvidia)— Computer-aided design (CAD), Digital content creation (DCC), emerging
HPC applications, …
• FPGA (e.g., Xilinx DRC)—HPC, Petroleum, Financial, …
• HPC accelerators (e.g., Clearspeed)— HPC, Network processing, Graphics, …
• Stream Processors (e.g., Imagine)—Image processing, Signal processing, Video, Graphics, …
• Others—TCP/IP offload, Crypto, …
9COMP 635, Fall 2007 (V.Sarkar)
Programming Models for Heterogeneous Processors
• Data Parallelism
• Single Program Multiple Data (SPMD)
• Pipelining
• Work Queue
• Fork Join
• Message Passing
• Storage Models: Shared vs. Local vs. Partitioned Memories
• Hybrid combinations of above
Only a limited subset of these models are in production usetoday ==> programming model implementations forheterogeneous processors will have to grow to accommodatenew application domains and new classes of programmers
10COMP 635, Fall 2007 (V.Sarkar)
Heterogeneous Processor Spectrum
HeterogeneousMulticore
Dimension 1:Distance ofaccelerator frommain processor
Dimension 2:Hardwarecustomization inaccelerator
11COMP 635, Fall 2007 (V.Sarkar)
Heterogeneous Processor Spectrum
HeterogeneousMulticore
Dimension 1:Distance ofaccelerator frommain processor
Dimension 2:Hardwarecustomization inaccelerator
Focus of this course
Focus of this course
12COMP 635, Fall 2007 (V.Sarkar)
Spectrum of Programmers for HeterogeneousProcessors
• Application-level Users— Plug & play experience by using ISV frameworks such as
MATLAB and Mathematica, etc
• Library-level Programmers— Portable library interface that works across homogeneous and
heterogeneous processors
• Language-level Programmers— Portable programming language that works across
homogeneous and heterogeneous processors— Conspicuous lack of new languages for heterogeneous
processors, especially languages with managed runtimes!
• SDK-level Programmers— C-based compilers and tools that are specific to a given
heterogeneous processor
13COMP 635, Fall 2007 (V.Sarkar)
Spectrum of Programmers for HeterogeneousProcessors
• Application-level Users— Plug & play experience by using ISV frameworks such as
MATLAB and Mathematica, etc
• Library-level Programmers— Portable library interface that works across homogeneous and
heterogeneous processors
• Language-level Programmers— Portable programming language that works across
homogeneous and heterogeneous processors— Conspicuous lack of new languages for heterogeneous
processors, especially languages with managed runtimes!
• SDK-level Programmers— C-based compilers and tools that are specific to a given
heterogeneous processor
Focus of this course
14COMP 635, Fall 2007 (V.Sarkar)
Cell Broadband Engine (BE)
15COMP 635, Fall 2007 (V.Sarkar)
Cell Performance
16COMP 635, Fall 2007 (V.Sarkar)
Cell Temperature Distribution
Power and heat are key constraints
17COMP 635, Fall 2007 (V.Sarkar)
Code Partitioning for Cell
Flow Graph Node
Call Graph Node
Flow Graph Edge
Call Graph Edge
Key
Outlining Cloning
Compile forPPE
Compilefor SPE
• Outlining: extract parallel loop into a separate procedure• Cloning: make separate copies for PPE and SPE, including clones of allprocedures called from loop• Coordination: insert operations on signal registers and mailbox queues in PPEand SPE codes• Reference: “Using advanced compiler technology to exploit the performance ofthe Cell Broadband Engine architecture”, A. Eichenberger et al, IBM SystemsJournal, Vol 45, No 1, 2006
18COMP 635, Fall 2007 (V.Sarkar)
• A quiet revolution and potential build-up— Calculation: 367 GFLOPS vs. 32 GFLOPS— Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s— Until last year, programmed through graphics API
— GPU in every PC and workstation – massive volume and potential impact
Why GPUs?
19COMP 635, Fall 2007 (V.Sarkar)
Sample GPU Applications
16%931,365Finite-Difference Time Domain analysis of2D electromagnetic wave propagation
FDTD
>99%33490Computing a matrix Q, a scanner’sconfiguration in MRI reconstruction
MRI-Q
96%98536Two Point Angular Correlation FunctionTRACF
>99%31952Single-precision implementation of saxpy,used in Linpack’s Gaussian elim. routine
SAXPY
>99%160322Petri Net simulation of a distributed systemPNS
99%2811,104Rye Polynomial Equation Solver, quantumchem, 2-electron repulsion
RPES
99%1461,874Finite element modeling, simulation of 3Dgraded materials
FEM
>99%2181,979Distributed.net RC5-72 challenge client codeRC5-72
>99%2851,481SPEC ‘06 version, change to single precisionand print fewer reports
LBM
35%19434,811SPEC ‘06 version, change in guess vectorH.264
% timeKernelSourceDescriptionApplication
20COMP 635, Fall 2007 (V.Sarkar)
Performance of Sample Kernels and Applications
• GeForce 8800 GTX vs. 2.2GHz Opteron 248• 10× speedup in a kernel is typical, as long as the kernel can occupy enough
parallel threads• 25× to 400× speedup if the function’s data requirements and control flow suit
the GPU and the application is optimized• Keep in mind that the speedup also reflects how suitable the CPU is for
executing the kernelSource: Slide 21, Lecture 1, UIUC ECE 498, David Kirk & Wen-mei Hwu, http://courses.ece.uiuc.edu/ece498/al1/lectures/lecture1%20intro%20fall%202007.ppt
21COMP 635, Fall 2007 (V.Sarkar)
FPGAs: Basics of FPGA Offload
Source: “Compiling Software Code to FPGA-based Accelerator Processors for HPC Applications” by Doug Johnson,[email protected], gladiator.ncsa.uiuc.edu/PDFs/rssi06/presentations/14_Doug_Johnson.pdf
22COMP 635, Fall 2007 (V.Sarkar)
FPGA Acceleration Examples
23COMP 635, Fall 2007 (V.Sarkar)
ClearSpeed Multi-Threaded Array Processor (MTAP)
• Hardware multi- threading forlatency tolerance
• Asynchronous, overlapped I/O
• Poly execution unit contains 96Processor Elements (PE’s) orcores.
• Array of PE’s operates in asynchronous manner, i.e. eachPE executes the sameinstruction on its data.
Source: “Accelerating HPC Applications with ClearSpeed”by Daniel Kliger, [email protected],www.cse.scitech.ac.uk/disco/mew17/talks/ClearSpeed%20Daresbury%20MEW%202006.pdf
24COMP 635, Fall 2007 (V.Sarkar)
Clearspeed Linpack results
• Standard System—Two 3.0 GHz Intel Xeon 5160 (Woodcrest) dual core processors,
16GB memory per node– Single server: 34 GFLOPS– Four node cluster: 136 GFLOPS– Power consumption: 1,940 Watts– Benchmark runtime: 48.4 minutes
• ClearSpeed Accelerated System—Add two Advance accelerator boards per node (25W per board!)
– Single server: 90.1 GFLOPS– Four node cluster: 364.2 GFLOPS– Power consumption: 2,140 Watts– Benchmark runtime: 18.4 minutes
25COMP 635, Fall 2007 (V.Sarkar)
ClearSpeed’s CSXL acceleration library
The CSXL acceleration library intercepts and accelerates calls tofunctions in the Basic Linear Algebra Subprograms (BLAS) library.These include Level 3 BLAS DGEMM calls and LAPACK DGETRFcalls.
26COMP 635, Fall 2007 (V.Sarkar)
Imagine Stream Processor
27COMP 635, Fall 2007 (V.Sarkar)
Transforming Memory Accesses to Communicationfor Scalability
Software challenge: deliver productivity of shared memory model, combined with scalability of communication model
28COMP 635, Fall 2007 (V.Sarkar)
Example of how Compilers can Help
Source: UIUC ECE 497, courses.ece.uiuc.edu/ece412/lectures/lecture16.ppt
Opportunity for new languages to reducecompiler effort and
broaden applicability
29COMP 635, Fall 2007 (V.Sarkar)
Code Partitioning for Heterogeneous Processors
• Factors to consider when extracting a region of code for executionon an accelerator— Matching operations in code region with primitives in
accelerator (includes instruction selection and FPGA synthesis)— Establishing coherence between main and local memories— Obeying local memory size constraints— Volume of data to be communicated— Granularity of region relative to overhead of thread creation— Structural constraints of task/thread being extracted— Cloning of code that needs to be executed on multiple elements— Coordination with rest of the program (coroutine vs. macro-
dataflow models)— . . .
30COMP 635, Fall 2007 (V.Sarkar)
Reading List for Next Lecture (Sep 10th)
1. “Using advanced compiler technology to exploit the performance of the CellBroadband Engine architecture”, A. Eichenberger et al, IBM Systems Journal,Vol 45, No 1, 2006,http://researchweb.watson.ibm.com/journal/sj/451/eichenberger.pdf
2. “Dynamic Multigrain Parallelization on the Cell Broadband Engine”, F. Blagojevicet al, PPoPP 2007 Best Paper, March 2007,http://portal.acm.org/ft_gateway.cfm?id=1229445&type=pdf&coll=portal&dl=ACM&CFID=14018324&CFTOKEN=91433508
31COMP 635, Fall 2007 (V.Sarkar)
Announcement: Kickoff Meeting for HabaneroMulticore Software Research Project
Habanero is a new research project focused onMulticore Software. Its scope will span programminglanguages, compilers, virtual machines, and low-levelruntime systems, and is synergistic with the expertisewe have in various CS groups at Rice including theParallel Compilers, Scalar Compilers, ProgrammingLanguage Technologies, and Systems groups. Akickoff meeting for the Habanero project is scheduledfor 1pm - 2:30pm on Wednesday, August 29th in DH3076. Cookies will be served!
32COMP 635, Fall 2007 (V.Sarkar)
BACKUP SLIDES START HERE
33COMP 635, Fall 2007 (V.Sarkar)
Freescale MPC8572 PowerQUICC III Processor
• Dual Embedded e500 core 36-bit physical addressing• Double-precision floating-point• Integrated L1/L2 cache
— L1 cache—32 KB data and 32 KB— Shared L2 cache—1 MB with ECC— L2 configurable as SRAM, cache and I/O transactions can be
stashed into L2 cache regions• Integrated DDR memory controller with• full ECC support• Integrated security engine, Pattern Matching Engine, Packet
Deflate Engine• Four on-chip triple-speed Ethernet controllers
34COMP 635, Fall 2007 (V.Sarkar)
Freescale MPC8572 PowerQUICC III Processor
Source: Freescale
35COMP 635, Fall 2007 (V.Sarkar)
AMD’s use of HyperTransport (Torrenza)
• “Torrenza” technology— Allows licensing of coherent
HyperTransport™ to 3rd partymanufacturers to make socket-compatible accelerators/co-processors
— Allows 3rd party PPUs (PhysicsProcessing Unit), GPUs, and co-processors to access main systemmemory directly and coherently
— Could make acceleratorprogramming model easier to usethan say, the Cell processor, whereeach SPE cannot directly accessmain memory.