Download pdf - Comp422 2011 Lecture1 Introduction

8/6/2019 Comp422 2011 Lecture1 Introduction

http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 1/50

COMP 422 Parallel Computing:An Introduction

John Mellor-Crummey

Department of Computer Science

Rice University [email protected]

COMP 422 Lecture 1 11 January 2011



2

Course Information

• Meeting time: TTh 1:00-2:15

• Meeting place: DH 1075

• Instructor: John Mellor-Crummey

—Email: [email protected]

—Office: DH 3082, x5179—Office Hours: Wednesday 8:30am-9:30am or by appointment

• TA: Xu Liu

—Email: [email protected]

—Office: DH 3063, x5710—Office Hours: TBA

• WWW site: http://www.clear.rice.edu/comp422



3

Parallelism

• Definition: ability to execute parts of a program concurrently

• Goal: shorter running time

• Grain of parallelism: how big are the parts?

—bit, instruction, statement, loop iteration, procedure, …

• COMP 422 focus: explicit thread-level parallelism



4

Course Objectives

• Learn fundamentals of parallel computing

—principles of parallel algorithm design

—programming models and methods

—parallel computer architectures

—parallel algorithms

—modeling and analysis of parallel programs and systems

• Develop skill writing parallel programs

—programming assignments

• Develop skill analyzing parallel computing problems

—solving problems posed in class



5

Text

Introduction to ParallelComputing, 2nd Edition

Ananth Grama,

Anshul Gupta,

George Karypis,

Vipin Kumar

Addison-Wesley

2003

Required



Topics (Part 1)

• Introduction (Chapter 1)

• Principles of parallel algorithm design (Chapter 3)

—decomposition techniques

—mapping & scheduling computation

—templates

• Programming shared-address space systems (Chapter 7)

—Cilk/Cilk++

—OpenMP

—Pthreads

—synchronization

• Parallel computer architectures (Chapter 2)

—shared memory systems and cache coherence

—distributed-memory systems

—interconnection networks and routing6



Topics (Part 2)

• Programming scalable systems (Chapter 6)—message passing: MPI

—global address space languages: UPC, CAF

• Collective communication

• Analytical modeling of program performance (Chapter 5)—speedup, efficiency, scalability, cost optimality, isoefficiency

• Parallel algorithms (Chapters 8 & 10)

—non-numerical algorithms: sorting, graphs, dynamic prog.

—numerical algorithms: dense and sparse matrix algorithms

• Performance measurement and analysis of parallel programs

• GPU Programming with CUDA and OpenCL

• Problem solving on clusters using MapReduce

7



8

Prerequisites

• Programming in C and/or Fortran

• Basics of data structures

• Basics of machine architecture

• Required courses: Comp 221 or equivalent

• See me if you have concerns!



9

Motivating Parallel Computing

• Technology push

• Application pull



10

The Rise of MulticoreProcessors



11

Advance of Semiconductors: “Moore’s Law”

Gordon Moore, Founder of Intel

• 1965: since the integrated circuit was invented, the number of

transistors/inch2 in these circuits roughly doubled every year;

this trend would continue for the foreseeable future

• 1975: revised - circuit complexity doubles every 18 months

Image credit: http://download.intel.com/research/silicon/Gordon_Moore_ISSCC_021003.pdf



12

Leveraging Moore’s Law Trends

From increasing circuit density to performance

• More transistors = ↑ opportunities for exploiting parallelism

• Microprocessors

—implicit parallelism: not visible to the programmer

– pipelined execution of instructions – multiple functional units for multiple independent pipelines

—explicit parallelism

– streaming and multimedia processor extensions

operations on 1, 2, and 4 data items per instruction

integer, floating point, complex dataMMX, Altivec, SSE

– long instruction words

bundles of independent instructions that can be issued together



13

Microprocessor Architecture (Mid 90’s)

• Superscalar (SS) designs were the state of the art—multiple instruction issue

—dynamic scheduling: h/w tracks dependencies between instr.

—speculative execution: look past predicted branches

—non-blocking caches: multiple outstanding memory ops

• Apparent path to higher performance?—wider instruction issue

—support for more speculation



14

The End of the Free Lunch

Increasing issue width provides diminishing returns

Two factors1

• Fundamental circuit limitations

—delays ⇑ as issue queues ⇑ and multi-port register files ⇑

—increasing delays limit performance returns from wider issue

• Limited amount of instruction-level parallelism

—inefficient for codes with difficult-to-predict branches

1The case for a single-chip multiprocessor , K. Olukotun, B. Nayfeh,L. Hammond, K. Wilson, and K. Chang, ASPLOS-VII, 1996.



15

Issue Waste

Instruction-level Parallelism Concerns

• Contributing factors

—instruction dependencies

—long-latency operations within a thread



16

Sources of Wasted Issue Slots

•TLB miss

• I cache miss

• D cache miss

• Load delays (L1 hits)

• Control hazard

• Branch misprediction

• Instruction dependences

• Memory conflict

Memory Hierarchy

Control Flow

Instruction Stream

Not knowing which instruction toexecute after a branch (will the

branch be taken or not)



17

How much IPC? Simulate 8-issue SS

Simultaneous multithreading: maximizing

on-chip parallelism, Tullsen et. al. ISCA, 1995.

Applications: most of SPEC92

• On average < 1.5 IPC (19%)

• Short FP dependences: 37%

• 6 other causes ≥ 4.5%

• Dominant waste differs by application



18

Power and Heat Stall Clock Frequencies

May 17, 2004 … Intel, the world's largest chip maker, publiclyacknowledged that it had hit a ''thermal wall'' on its microprocessor line.As a result, the company is changing its product strategy and disbandingone of its most advanced design groups. Intel also said that it wouldabandon two advanced chip development projects …

Now, Intel is embarked on a course already adopted by some of its major

rivals: obtaining more computing power by stamping multiple processorson a single chip rather than straining to increase the speed of a singleprocessor … Intel's decision to change course and embrace a ''dual core''processor structure shows the challenge of overcoming the effects of heatgenerated by the constant on-off movement of tiny switches in moderncomputers … some analysts and former Intel designers said that Intel was coming to terms with escalating heat problems so severe they

threatened to cause its chips to fracture at extreme temperatures…

New York Times, May 17, 2004



Recent Multicore Processors

19

• March 10: Intel Westmere

— 6 cores; 2 way SMT

• March 10: AMD Magny Cours

— 2 6-core dies in a single package

• Jan 10: IBM Power 7

— 8 cores; 4-way SMT; 32MB shared cache

• June 09: AMD Istanbul— 6 cores

• Sun T2

— 8 cores; 8-way fine-grain MT per core

• IBM Cell

— 1 PPC core; 8 SPEs w/ SIMD parallelism

• Tilera:

— 64 cores in an 8x8 mesh; 5MB cache on chip



August 2009: Nehalem EX

20

Figure credit: http://hothardware.com/News/Intel-Unveils-NehalemEX-OctalCore-Server-CPU



Jan 2010: IBM Power7

21

• Slide source: http://common.ziffdavisinternet.com/util_get_image/21/0,1425,sz=1&i=211884,00.jpg

• Slide attributed to IBM



New Applications for the Desktop

22

Intel’s Recognition, Mining, Synthesis Research Initiative

Compute-Intensive, Highly Parallel Applications and Uses

Intel Technology Journal, 9(2), May 19, 2005DOI: 10.1535/itj.0902

http://www.intel.com/technology/itj/2005/volume09issue02/

graphics, computer vision, data mining, large-scale optimization



23

Application Pull

• Complex problems require computation on large-scale data

• Sufficient performance available only through massive

parallelism



24

Computing and Science

• “Computational modeling and simulation are among the mostsignificant developments in the practice of scientific inquiry in the

20th century. Within the last two decades, scientific computing

has become an important contributor to all scientific disciplines.

• It is particularly important for the solution of research problems

that are insoluble by traditional scientific theoretical and

experimental approaches, hazardous to study in the laboratory,

or time consuming or expensive to solve by traditional means”

— “Scientific Discovery through Advanced Computing”DOE Office of Science, 2000



25

The Need for Speed: Complex Problems

• Science—understanding matter from elementary particles to cosmology

—storm forecasting and climate prediction

—understanding biochemical processes of living organisms

• Engineering—combustion and engine design

—computational fluid dynamics and airplane design

—earthquake and structural modeling

—pollution modeling and remediation planning—molecular nanotechnology

• Business

—computational finance—information retrieval

—data mining

• Defense—nuclear weapons stewardship—cryptology



26

Earthquake Simulation

Earthquake Research Institute, University of Tokyo

Tonankai-Tokai Earthquake Scenario

Photo Credit: The Earth Simulator Art Gallery, CD-ROM, March 2004



27

Ocean Circulation Simulation

Ocean Global Circulation Model for the Earth Simulator

Seasonal Variation of Ocean Temperature

Photo Credit: The Earth Simulator Art Gallery, CD-ROM, March 2004



Community Earth System Model (CESM)

28

Figure courtesy of M. Vertenstein (NCAR)



CESM Execution Configurations

29

Figure courtesy of M. Vertenstein (NCAR)



CESM Simulations on a Cray XT

30

Figure courtesy of Pat Worley (ORNL)



31

Simulating Turbulent Reacting Flows: S3D

•Direct numerical simulation (DNS) of turbulent combustion

—state-of-the-art code developed at CRF/Sandia

– PI: Jaqueline H. Chen, SNL

—2010: Cray XT (65M) Blue Gene/P(2M) CPU hours

—“High-fidelity simulations for clean and

efficient combustion of alternative fuels”

• Science

—study micro-physics of turbulent reacting flows

– physical insight into chemistry turbulence interactions

—simulate detailed chemistry; multi-physics (sprays, radiation, soot)

—develop and validate reduced model descriptions used in macro-scale simulations of engineering-level systems

DNS Physical Models

Engineering CFD codes

(RANS, LES)

Text and figures courtesy of Jacqueline H. Chen, SNL



32

Fluid-Structure Interactions

• Simulate …—rotational geometries (e.g. engines, pumps), flapping wings

• Traditionally, such simulations have used a fixed mesh

—drawback: solution quality is only as good as initial mesh

• Dynamic mesh computational fluid dynamics—integrate automatic mesh generation within parallel flow solver

– nodes added in response to user-specified refinement criteria

– nodes deleted when no longer needed

– element connectivity changes to maintain minimum energy mesh

—mesh changes continuously as geometry + solution changes• Example: 3D simulation of a hummingbird’s flight

[Andrew Johnson, AHPCRC 2005]



33

Air Velocity (Front)



34

Air Velocity (Side)



35

Mesh Adaptation (front)



36

Mesh Adaptation (side)



37

Challenges of Explicit Parallelism

• Algorithm development is harder —complexity of specifying and coordinating concurrent activities

• Software development is much harder

—lack of standardized & effective development tools andprogramming models

—subtle program errors: race conditions

• Rapid pace of change in computer system architecture

—today’s hot parallel algorithm may not be a good match for tomorrow’s parallel hardware!

– example: homogeneous multicore processors vs. GPGPU



38

Hummingbird Simulation in UPC

• UPC: PGAS language for scalable parallel systems

• Application overview

—distribute mesh among the processors

—partition the mesh using recursive bisection

—each PE maintains and controls its piece of the mesh

– has a list of nodes, faces, and elements

—communication and synchronization

– read-from or write-to other PE’s “entities” as required

– processors frequently synchronize using barriers

– use “broadcast” and “reduction” patterns

—constraint

– only 1 processor may change the mesh at a time



39

Algorithm Sketch

At each time step…

• Test if re-partitioning is required

• Set-up inter processor communication if mesh changed

• Block (color) elements into vectorizable groups

• Calculate the refinement value at each mesh node

• Move the mesh

• Solve the coupled fluid-flow equation system

• Update the mesh to ensure mesh quality—swap element faces to obtain a “Delaunay” mesh

—add nodes to locations where there are not enough

—delete nodes from locations where there are too many

—swap element faces to obtain a “Delaunay” mesh



40

Parallel Hardwarein the Large



41

Hierarchical Parallelism in Supercomputers

• Cores with pipelining and short vectors

• Multicore processors

• Shared-memory multiprocessor nodes

• Scalable parallel systems

Image credit: http://www.nersc.gov/news/reports/bluegene.gif



42

Achieving High Performance on Parallel Systems

• Memory latency and bandwidth

— CPU rates have improved 4x asfast as memory over last decade

— bridge speed gap using memory

hierarchy— multicore exacerbates demand

• Interprocessor communication

Computation is only part of the picture

• Input/output

— I/O bandwidth to disk typically grows linearly with # processors

Image Credit: Bob Colwell, ISCA 1995

CPU

system



43

Historical Concurrency in Top 500 Systems

Image credit: http://www.top500.org/overtime/list/36/procclass



Scale of Largest HPC Systems (Nov. 2010)

44Image credit: http://www.top500.org/list/2010/11/100

> 100Kcores

>1015 FLOP/s

almost300Kcores

hybridCPU+GPU





46

Forthcoming NSF “Blue Waters” System

• NSF’s goal for 2006-2011“enable petascale science and engineering through thedeployment and support of a world-class HPC environmentcomprising the most capable combination of HPC assetsavailable to the academic community”

•Scientific aim: support computationally challenging problems

—simulations that are intrinsically multi-scale, or

—simulations involving interaction of multiple processes

• Acquisition: Blue Waters @ UIUC in 2011

—goal: sustained petaflop for science and engineering codes

—$208M award from NSF—system characteristics: powerful cores, high performance

interconnect, large high performance memory subsystem, highperformance I/O system, high reliability

—vendor: IBM

—based on the 4-way SMT, 8-core Power7 processor



Challenges of Parallelism in the Large

• Parallel science applications are often very sophisticated

— e.g. adaptive algorithms may require dynamic load balancing

• Multilevel parallelism is difficult to manage

• Extreme scale exacerbates inefficiencies

— algorithmic scalability losses

— serialization and load imbalance

— communication or I/O bottlenecks

— insufficient or inefficient parallelization

• Hard to achieve top performance even on individual nodes

— contention for shared memory bandwidth

— memory hierarchy utilization on multicore processors

47



48

Thursday’s Class

• Introduction to parallel algorithms—tasks and decomposition

—processes and mapping

—processes versus processors

• Decomposition techniques

—recursive decomposition

—data decomposition

—exploratory decomposition

—hybrid decomposition

• Characteristics of tasks and interactions—task generation, granularity, and context

—characteristics of task interactions



49

Parallel Machines for the Course

• STIC—170 nodes, each with two 4-core Intel Nehalem processors

—Infiniband interconnection network

—no global shared memory

• Biou

—48 nodes, each with four 8-core IBM Power7 processors

– 4-way SMT (4 hardware threads per processor core); 256GB/node

—Infiniband interconnection network

—no global shared memory



Reading Assignments

• For Thursday—Chapter 1

—Chapter 3: 85-104

• For Tuesday

—Chapter 3: 105-142

50