Course Goals - Rice Universityvs3/PDF/comp635-lec1-v3.pdf · — For a satisfactory grade, you need to 1. Attend at least 50% of lectures 2. Submit a 4-page project/study report by

Vivek Sarkar

Department of Computer ScienceRice University

[email protected]

August 27, 2007

COMP 635: Seminar on HeterogeneousProcessors

www.cs.rice.edu/~vsarkar/comp635

2COMP 635, Fall 2007 (V.Sarkar)

Course Goals

• Gain familiarity with heterogeneous processor systems bystudying a few sample design points in the spectrum

• Study and critique current software environments for thesedesigns (programming models, compilers, tools, runtimes)

• Discuss research challenges in advancing the state of the artof software for heterogeneous processors

• Target audience: software, hardware, and applicationresearchers interested in building or using heterogeneousprocessor systems, or understanding strengths andweaknesses of heterogeneous processors w.r.t. their researchareas


Course Organization• Class dates (12 lectures)

— 8/27, 9/10, 9/20 (Thurs), 9/24, 10/1, 10/8, 10/22, 10/29, 11/5, 11/19, 11/26, 12/3— No classes on 9/3 (Labor Day), 10/15 (Midterm Recess), 11/12 (Supercomputing 2007

conference week)— No class on 9/17 (Mon); we will meet on 9/20 (Thurs) instead that week

• Time & Place— Default: Mondays, 3:30pm - 4:30pm, DH 2014— Exception: time & place for 9/20 (Thurs) lecture TBD— 30 minutes reserved after lecture for discussion (optional)

• Office Hours (DH 3131)— 11am - 12noon, Fridays from 8/31/07 to 12/7/07

• OWL-Space repository: COMP 635 F07

• Grading— Satisfactory/unsatisfactory grade for students taking seminar for credit

– Others should register officially as auditors, if possible— For a satisfactory grade, you need to

1. Attend at least 50% of lectures2. Submit a 4-page project/study report by 12/7/07 (report can be prepared in a group - just

plan on 4 pages/person in that case)— Optional in-class presentation of project/study report on 12/3/07


Course Content• Introduction to Heterogeneous Processors and their Programming

Models (1 lecture)

• Cell Processor and Cell SDK (2 lectures)

• Nvidia GPU and CUDA programming environment (2 lectures)

• DRC FPGA Coprocessor Module and Celoxica ProgrammingEnvironment (1 lecture)

• Clearspeed Accelerator and SDK (1 lecture)

• Imagine Stream Processor (1 lecture)

• Microsoft Accelerator Library (1 lecture)

• Vector and SIMD processors -- a historical perspective (1 lecture)

• Programming Model and Runtime Desiderata for futureHeterogeneous Processors (1 lecture)

• Student presentations (1 lecture)


COMP 635 Lecture 1: Introduction toHeterogeneous Processors and their

Programming Models


Acknowledgments

• Georgia Tech ECE 6100, Module 14— Vince Mooney, Krishna Palem, Sudhakar Yalamanchili—http://www.ece.gatech.edu/academic/courses/fall2006/ece6100/Class/ind

ex.html

• MIT 6.189 IAP 2007, Lecture 2—“Introduction to the Cell Processor”, Michael Perrone— http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf

• UIUC ECE 497, Lecture 16—courses.ece.uiuc.edu/ece412/lectures/lecture16.ppt

• UIUC ECE 498 AL1, Programming Massively Parallel Processors— David Kirk, Wen-mei Hwu—http://courses.ece.uiuc.edu/ece498/al1/Syllabus.html


Heterogeneous Processors

ACC

LOCALMEMORY

ACC

MA

I NM

EMO

RY

GPP

MTM

ACC

LOCALMEMORY

Memory transfermoduleschedulessystem-wide bulkdata movement

General-purpose processororchestrates activity

Accelerators can usescheduled, streamingcommunication…

or can operate onlocally-buffered datapushed to them inadvance

Accelerated activities and associated private dataare localized for bandwidth, power, efficiency

Motivation:

1) Different parts of programs have differentrequirements

Control-intensive portions need goodbranch predictors, speculation, bigcaches to achieve good performance

Data-processing portions need lots ofALUs, have simpler control flows

2) Power consumptionFeatures like branch prediction, out-of-

order execution, tend to have veryhigh power/performance ratios.

Applications often have time-varyingperformance requirements


Sample Application Domains forHeterogeneous Processors

• Cell Processor— Medical imaging, Drug discovery, Reservoir modeling, Seismic analysis,

…

• GPU (e.g., Nvidia)— Computer-aided design (CAD), Digital content creation (DCC), emerging

HPC applications, …

• FPGA (e.g., Xilinx DRC)—HPC, Petroleum, Financial, …

• HPC accelerators (e.g., Clearspeed)— HPC, Network processing, Graphics, …

• Stream Processors (e.g., Imagine)—Image processing, Signal processing, Video, Graphics, …

• Others—TCP/IP offload, Crypto, …


Programming Models for Heterogeneous Processors

• Data Parallelism

• Single Program Multiple Data (SPMD)

• Pipelining

• Work Queue

• Fork Join

• Message Passing

• Storage Models: Shared vs. Local vs. Partitioned Memories

• Hybrid combinations of above

Only a limited subset of these models are in production usetoday ==> programming model implementations forheterogeneous processors will have to grow to accommodatenew application domains and new classes of programmers


Heterogeneous Processor Spectrum

HeterogeneousMulticore

Dimension 1:Distance ofaccelerator frommain processor

Dimension 2:Hardwarecustomization inaccelerator


Heterogeneous Processor Spectrum

HeterogeneousMulticore

Dimension 1:Distance ofaccelerator frommain processor

Dimension 2:Hardwarecustomization inaccelerator

Focus of this course



Spectrum of Programmers for HeterogeneousProcessors

• Application-level Users— Plug & play experience by using ISV frameworks such as

MATLAB and Mathematica, etc

• Library-level Programmers— Portable library interface that works across homogeneous and

heterogeneous processors

• Language-level Programmers— Portable programming language that works across

homogeneous and heterogeneous processors— Conspicuous lack of new languages for heterogeneous

processors, especially languages with managed runtimes!

• SDK-level Programmers— C-based compilers and tools that are specific to a given

heterogeneous processor


Spectrum of Programmers for HeterogeneousProcessors

• Application-level Users— Plug & play experience by using ISV frameworks such as

MATLAB and Mathematica, etc

• Library-level Programmers— Portable library interface that works across homogeneous and

heterogeneous processors

• Language-level Programmers— Portable programming language that works across

homogeneous and heterogeneous processors— Conspicuous lack of new languages for heterogeneous

processors, especially languages with managed runtimes!

• SDK-level Programmers— C-based compilers and tools that are specific to a given

heterogeneous processor



Cell Broadband Engine (BE)


Cell Performance


Cell Temperature Distribution

Power and heat are key constraints


Code Partitioning for Cell

Flow Graph Node

Call Graph Node

Flow Graph Edge

Call Graph Edge

Key

Outlining Cloning

Compile forPPE

Compilefor SPE

• Outlining: extract parallel loop into a separate procedure• Cloning: make separate copies for PPE and SPE, including clones of allprocedures called from loop• Coordination: insert operations on signal registers and mailbox queues in PPEand SPE codes• Reference: “Using advanced compiler technology to exploit the performance ofthe Cell Broadband Engine architecture”, A. Eichenberger et al, IBM SystemsJournal, Vol 45, No 1, 2006


• A quiet revolution and potential build-up— Calculation: 367 GFLOPS vs. 32 GFLOPS— Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s— Until last year, programmed through graphics API

— GPU in every PC and workstation – massive volume and potential impact

Why GPUs?


Sample GPU Applications

16%931,365Finite-Difference Time Domain analysis of2D electromagnetic wave propagation

FDTD

>99%33490Computing a matrix Q, a scanner’sconfiguration in MRI reconstruction

MRI-Q

96%98536Two Point Angular Correlation FunctionTRACF

>99%31952Single-precision implementation of saxpy,used in Linpack’s Gaussian elim. routine

SAXPY

>99%160322Petri Net simulation of a distributed systemPNS

99%2811,104Rye Polynomial Equation Solver, quantumchem, 2-electron repulsion

RPES

99%1461,874Finite element modeling, simulation of 3Dgraded materials

FEM

>99%2181,979Distributed.net RC5-72 challenge client codeRC5-72

>99%2851,481SPEC ‘06 version, change to single precisionand print fewer reports

LBM

35%19434,811SPEC ‘06 version, change in guess vectorH.264

% timeKernelSourceDescriptionApplication


Performance of Sample Kernels and Applications

• GeForce 8800 GTX vs. 2.2GHz Opteron 248• 10× speedup in a kernel is typical, as long as the kernel can occupy enough

parallel threads• 25× to 400× speedup if the function’s data requirements and control flow suit

the GPU and the application is optimized• Keep in mind that the speedup also reflects how suitable the CPU is for

executing the kernelSource: Slide 21, Lecture 1, UIUC ECE 498, David Kirk & Wen-mei Hwu, http://courses.ece.uiuc.edu/ece498/al1/lectures/lecture1%20intro%20fall%202007.ppt


FPGAs: Basics of FPGA Offload

Source: “Compiling Software Code to FPGA-based Accelerator Processors for HPC Applications” by Doug Johnson,[email protected], gladiator.ncsa.uiuc.edu/PDFs/rssi06/presentations/14_Doug_Johnson.pdf


FPGA Acceleration Examples


ClearSpeed Multi-Threaded Array Processor (MTAP)

• Hardware multi- threading forlatency tolerance

• Asynchronous, overlapped I/O

• Poly execution unit contains 96Processor Elements (PE’s) orcores.

• Array of PE’s operates in asynchronous manner, i.e. eachPE executes the sameinstruction on its data.

Source: “Accelerating HPC Applications with ClearSpeed”by Daniel Kliger, [email protected],www.cse.scitech.ac.uk/disco/mew17/talks/ClearSpeed%20Daresbury%20MEW%202006.pdf


Clearspeed Linpack results

• Standard System—Two 3.0 GHz Intel Xeon 5160 (Woodcrest) dual core processors,

16GB memory per node– Single server: 34 GFLOPS– Four node cluster: 136 GFLOPS– Power consumption: 1,940 Watts– Benchmark runtime: 48.4 minutes

• ClearSpeed Accelerated System—Add two Advance accelerator boards per node (25W per board!)

– Single server: 90.1 GFLOPS– Four node cluster: 364.2 GFLOPS– Power consumption: 2,140 Watts– Benchmark runtime: 18.4 minutes


ClearSpeed’s CSXL acceleration library

The CSXL acceleration library intercepts and accelerates calls tofunctions in the Basic Linear Algebra Subprograms (BLAS) library.These include Level 3 BLAS DGEMM calls and LAPACK DGETRFcalls.


Imagine Stream Processor


Transforming Memory Accesses to Communicationfor Scalability

Software challenge: deliver productivity of shared memory model, combined with scalability of communication model


Example of how Compilers can Help

Source: UIUC ECE 497, courses.ece.uiuc.edu/ece412/lectures/lecture16.ppt

Opportunity for new languages to reducecompiler effort and

broaden applicability


Code Partitioning for Heterogeneous Processors

• Factors to consider when extracting a region of code for executionon an accelerator— Matching operations in code region with primitives in

accelerator (includes instruction selection and FPGA synthesis)— Establishing coherence between main and local memories— Obeying local memory size constraints— Volume of data to be communicated— Granularity of region relative to overhead of thread creation— Structural constraints of task/thread being extracted— Cloning of code that needs to be executed on multiple elements— Coordination with rest of the program (coroutine vs. macro-

dataflow models)— . . .


Reading List for Next Lecture (Sep 10th)

1. “Using advanced compiler technology to exploit the performance of the CellBroadband Engine architecture”, A. Eichenberger et al, IBM Systems Journal,Vol 45, No 1, 2006,http://researchweb.watson.ibm.com/journal/sj/451/eichenberger.pdf

2. “Dynamic Multigrain Parallelization on the Cell Broadband Engine”, F. Blagojevicet al, PPoPP 2007 Best Paper, March 2007,http://portal.acm.org/ft_gateway.cfm?id=1229445&type=pdf&coll=portal&dl=ACM&CFID=14018324&CFTOKEN=91433508


Announcement: Kickoff Meeting for HabaneroMulticore Software Research Project

Habanero is a new research project focused onMulticore Software. Its scope will span programminglanguages, compilers, virtual machines, and low-levelruntime systems, and is synergistic with the expertisewe have in various CS groups at Rice including theParallel Compilers, Scalar Compilers, ProgrammingLanguage Technologies, and Systems groups. Akickoff meeting for the Habanero project is scheduledfor 1pm - 2:30pm on Wednesday, August 29th in DH3076. Cookies will be served!


BACKUP SLIDES START HERE


Freescale MPC8572 PowerQUICC III Processor

• Dual Embedded e500 core 36-bit physical addressing• Double-precision floating-point• Integrated L1/L2 cache

— L1 cache—32 KB data and 32 KB— Shared L2 cache—1 MB with ECC— L2 configurable as SRAM, cache and I/O transactions can be

stashed into L2 cache regions• Integrated DDR memory controller with• full ECC support• Integrated security engine, Pattern Matching Engine, Packet

Deflate Engine• Four on-chip triple-speed Ethernet controllers


Freescale MPC8572 PowerQUICC III Processor

Source: Freescale


AMD’s use of HyperTransport (Torrenza)

• “Torrenza” technology— Allows licensing of coherent

HyperTransport™ to 3rd partymanufacturers to make socket-compatible accelerators/co-processors

— Allows 3rd party PPUs (PhysicsProcessing Unit), GPUs, and co-processors to access main systemmemory directly and coherently

— Could make acceleratorprogramming model easier to usethan say, the Cell processor, whereeach SPE cannot directly accessmain memory.

Documents

Course Goals - Rice Universityvs3/PDF/comp635-lec1-v3.pdf · — For a satisfactory grade, you need to 1. Attend at least 50% of lectures 2. Submit a 4-page project/study report by