8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 1/50
COMP 422 Parallel Computing:An Introduction
John Mellor-Crummey
Department of Computer Science
Rice University [email protected]
COMP 422 Lecture 1 11 January 2011
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 2/50
2
Course Information
• Meeting time: TTh 1:00-2:15
• Meeting place: DH 1075
• Instructor: John Mellor-Crummey
—Email: [email protected]
—Office: DH 3082, x5179—Office Hours: Wednesday 8:30am-9:30am or by appointment
• TA: Xu Liu
—Email: [email protected]
—Office: DH 3063, x5710—Office Hours: TBA
• WWW site: http://www.clear.rice.edu/comp422
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 3/50
3
Parallelism
• Definition: ability to execute parts of a program concurrently
• Goal: shorter running time
• Grain of parallelism: how big are the parts?
—bit, instruction, statement, loop iteration, procedure, …
• COMP 422 focus: explicit thread-level parallelism
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 4/50
4
Course Objectives
• Learn fundamentals of parallel computing
—principles of parallel algorithm design
—programming models and methods
—parallel computer architectures
—parallel algorithms
—modeling and analysis of parallel programs and systems
• Develop skill writing parallel programs
—programming assignments
• Develop skill analyzing parallel computing problems
—solving problems posed in class
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 5/50
5
Text
Introduction to ParallelComputing, 2nd Edition
Ananth Grama,
Anshul Gupta,
George Karypis,
Vipin Kumar
Addison-Wesley
2003
Required
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 6/50
Topics (Part 1)
• Introduction (Chapter 1)
• Principles of parallel algorithm design (Chapter 3)
—decomposition techniques
—mapping & scheduling computation
—templates
• Programming shared-address space systems (Chapter 7)
—Cilk/Cilk++
—OpenMP
—Pthreads
—synchronization
• Parallel computer architectures (Chapter 2)
—shared memory systems and cache coherence
—distributed-memory systems
—interconnection networks and routing6
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 7/50
Topics (Part 2)
• Programming scalable systems (Chapter 6)—message passing: MPI
—global address space languages: UPC, CAF
• Collective communication
• Analytical modeling of program performance (Chapter 5)—speedup, efficiency, scalability, cost optimality, isoefficiency
• Parallel algorithms (Chapters 8 & 10)
—non-numerical algorithms: sorting, graphs, dynamic prog.
—numerical algorithms: dense and sparse matrix algorithms
• Performance measurement and analysis of parallel programs
• GPU Programming with CUDA and OpenCL
• Problem solving on clusters using MapReduce
7
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 8/50
8
Prerequisites
• Programming in C and/or Fortran
• Basics of data structures
• Basics of machine architecture
• Required courses: Comp 221 or equivalent
• See me if you have concerns!
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 9/50
9
Motivating Parallel Computing
• Technology push
• Application pull
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 10/50
10
The Rise of MulticoreProcessors
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 11/50
11
Advance of Semiconductors: “Moore’s Law”
Gordon Moore, Founder of Intel
• 1965: since the integrated circuit was invented, the number of
transistors/inch2 in these circuits roughly doubled every year;
this trend would continue for the foreseeable future
• 1975: revised - circuit complexity doubles every 18 months
Image credit: http://download.intel.com/research/silicon/Gordon_Moore_ISSCC_021003.pdf
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 12/50
12
Leveraging Moore’s Law Trends
From increasing circuit density to performance
• More transistors = ↑ opportunities for exploiting parallelism
• Microprocessors
—implicit parallelism: not visible to the programmer
– pipelined execution of instructions – multiple functional units for multiple independent pipelines
—explicit parallelism
– streaming and multimedia processor extensions
operations on 1, 2, and 4 data items per instruction
integer, floating point, complex dataMMX, Altivec, SSE
– long instruction words
bundles of independent instructions that can be issued together
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 13/50
13
Microprocessor Architecture (Mid 90’s)
• Superscalar (SS) designs were the state of the art—multiple instruction issue
—dynamic scheduling: h/w tracks dependencies between instr.
—speculative execution: look past predicted branches
—non-blocking caches: multiple outstanding memory ops
• Apparent path to higher performance?—wider instruction issue
—support for more speculation
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 14/50
14
The End of the Free Lunch
Increasing issue width provides diminishing returns
Two factors1
• Fundamental circuit limitations
—delays ⇑ as issue queues ⇑ and multi-port register files ⇑
—increasing delays limit performance returns from wider issue
• Limited amount of instruction-level parallelism
—inefficient for codes with difficult-to-predict branches
1The case for a single-chip multiprocessor , K. Olukotun, B. Nayfeh,L. Hammond, K. Wilson, and K. Chang, ASPLOS-VII, 1996.
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 15/50
15
Issue Waste
Instruction-level Parallelism Concerns
• Contributing factors
—instruction dependencies
—long-latency operations within a thread
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 16/50
16
Sources of Wasted Issue Slots
•TLB miss
• I cache miss
• D cache miss
• Load delays (L1 hits)
• Control hazard
• Branch misprediction
• Instruction dependences
• Memory conflict
Memory Hierarchy
Control Flow
Instruction Stream
Not knowing which instruction toexecute after a branch (will the
branch be taken or not)
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 17/50
17
How much IPC? Simulate 8-issue SS
Simultaneous multithreading: maximizing
on-chip parallelism, Tullsen et. al. ISCA, 1995.
Applications: most of SPEC92
• On average < 1.5 IPC (19%)
• Short FP dependences: 37%
• 6 other causes ≥ 4.5%
• Dominant waste differs by application
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 18/50
18
Power and Heat Stall Clock Frequencies
May 17, 2004 … Intel, the world's largest chip maker, publiclyacknowledged that it had hit a ''thermal wall'' on its microprocessor line.As a result, the company is changing its product strategy and disbandingone of its most advanced design groups. Intel also said that it wouldabandon two advanced chip development projects …
Now, Intel is embarked on a course already adopted by some of its major
rivals: obtaining more computing power by stamping multiple processorson a single chip rather than straining to increase the speed of a singleprocessor … Intel's decision to change course and embrace a ''dual core''processor structure shows the challenge of overcoming the effects of heatgenerated by the constant on-off movement of tiny switches in moderncomputers … some analysts and former Intel designers said that Intel was coming to terms with escalating heat problems so severe they
threatened to cause its chips to fracture at extreme temperatures…
New York Times, May 17, 2004
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 19/50
Recent Multicore Processors
19
• March 10: Intel Westmere
— 6 cores; 2 way SMT
• March 10: AMD Magny Cours
— 2 6-core dies in a single package
• Jan 10: IBM Power 7
— 8 cores; 4-way SMT; 32MB shared cache
• June 09: AMD Istanbul— 6 cores
• Sun T2
— 8 cores; 8-way fine-grain MT per core
• IBM Cell
— 1 PPC core; 8 SPEs w/ SIMD parallelism
• Tilera:
— 64 cores in an 8x8 mesh; 5MB cache on chip
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 20/50
August 2009: Nehalem EX
20
Figure credit: http://hothardware.com/News/Intel-Unveils-NehalemEX-OctalCore-Server-CPU
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 21/50
Jan 2010: IBM Power7
21
• Slide source: http://common.ziffdavisinternet.com/util_get_image/21/0,1425,sz=1&i=211884,00.jpg
• Slide attributed to IBM
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 22/50
New Applications for the Desktop
22
Intel’s Recognition, Mining, Synthesis Research Initiative
Compute-Intensive, Highly Parallel Applications and Uses
Intel Technology Journal, 9(2), May 19, 2005DOI: 10.1535/itj.0902
http://www.intel.com/technology/itj/2005/volume09issue02/
graphics, computer vision, data mining, large-scale optimization
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 23/50
23
Application Pull
• Complex problems require computation on large-scale data
• Sufficient performance available only through massive
parallelism
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 24/50
24
Computing and Science
• “Computational modeling and simulation are among the mostsignificant developments in the practice of scientific inquiry in the
20th century. Within the last two decades, scientific computing
has become an important contributor to all scientific disciplines.
• It is particularly important for the solution of research problems
that are insoluble by traditional scientific theoretical and
experimental approaches, hazardous to study in the laboratory,
or time consuming or expensive to solve by traditional means”
— “Scientific Discovery through Advanced Computing”DOE Office of Science, 2000
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 25/50
25
The Need for Speed: Complex Problems
• Science—understanding matter from elementary particles to cosmology
—storm forecasting and climate prediction
—understanding biochemical processes of living organisms
• Engineering—combustion and engine design
—computational fluid dynamics and airplane design
—earthquake and structural modeling
—pollution modeling and remediation planning—molecular nanotechnology
• Business
—computational finance—information retrieval
—data mining
• Defense—nuclear weapons stewardship—cryptology
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 26/50
26
Earthquake Simulation
Earthquake Research Institute, University of Tokyo
Tonankai-Tokai Earthquake Scenario
Photo Credit: The Earth Simulator Art Gallery, CD-ROM, March 2004
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 27/50
27
Ocean Circulation Simulation
Ocean Global Circulation Model for the Earth Simulator
Seasonal Variation of Ocean Temperature
Photo Credit: The Earth Simulator Art Gallery, CD-ROM, March 2004
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 28/50
Community Earth System Model (CESM)
28
Figure courtesy of M. Vertenstein (NCAR)
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 29/50
CESM Execution Configurations
29
Figure courtesy of M. Vertenstein (NCAR)
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 30/50
CESM Simulations on a Cray XT
30
Figure courtesy of Pat Worley (ORNL)
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 31/50
31
Simulating Turbulent Reacting Flows: S3D
•Direct numerical simulation (DNS) of turbulent combustion
—state-of-the-art code developed at CRF/Sandia
– PI: Jaqueline H. Chen, SNL
—2010: Cray XT (65M) Blue Gene/P(2M) CPU hours
—“High-fidelity simulations for clean and
efficient combustion of alternative fuels”
• Science
—study micro-physics of turbulent reacting flows
– physical insight into chemistry turbulence interactions
—simulate detailed chemistry; multi-physics (sprays, radiation, soot)
—develop and validate reduced model descriptions used in macro-scale simulations of engineering-level systems
DNS Physical Models
Engineering CFD codes
(RANS, LES)
Text and figures courtesy of Jacqueline H. Chen, SNL
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 32/50
32
Fluid-Structure Interactions
• Simulate …—rotational geometries (e.g. engines, pumps), flapping wings
• Traditionally, such simulations have used a fixed mesh
—drawback: solution quality is only as good as initial mesh
• Dynamic mesh computational fluid dynamics—integrate automatic mesh generation within parallel flow solver
– nodes added in response to user-specified refinement criteria
– nodes deleted when no longer needed
– element connectivity changes to maintain minimum energy mesh
—mesh changes continuously as geometry + solution changes• Example: 3D simulation of a hummingbird’s flight
[Andrew Johnson, AHPCRC 2005]
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 33/50
33
Air Velocity (Front)
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 34/50
34
Air Velocity (Side)
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 35/50
35
Mesh Adaptation (front)
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 36/50
36
Mesh Adaptation (side)
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 37/50
37
Challenges of Explicit Parallelism
• Algorithm development is harder —complexity of specifying and coordinating concurrent activities
• Software development is much harder
—lack of standardized & effective development tools andprogramming models
—subtle program errors: race conditions
• Rapid pace of change in computer system architecture
—today’s hot parallel algorithm may not be a good match for tomorrow’s parallel hardware!
– example: homogeneous multicore processors vs. GPGPU
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 38/50
38
Hummingbird Simulation in UPC
• UPC: PGAS language for scalable parallel systems
• Application overview
—distribute mesh among the processors
—partition the mesh using recursive bisection
—each PE maintains and controls its piece of the mesh
– has a list of nodes, faces, and elements
—communication and synchronization
– read-from or write-to other PE’s “entities” as required
– processors frequently synchronize using barriers
– use “broadcast” and “reduction” patterns
—constraint
– only 1 processor may change the mesh at a time
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 39/50
39
Algorithm Sketch
At each time step…
• Test if re-partitioning is required
• Set-up inter processor communication if mesh changed
• Block (color) elements into vectorizable groups
• Calculate the refinement value at each mesh node
• Move the mesh
• Solve the coupled fluid-flow equation system
• Update the mesh to ensure mesh quality—swap element faces to obtain a “Delaunay” mesh
—add nodes to locations where there are not enough
—delete nodes from locations where there are too many
—swap element faces to obtain a “Delaunay” mesh
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 40/50
40
Parallel Hardwarein the Large
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 41/50
41
Hierarchical Parallelism in Supercomputers
• Cores with pipelining and short vectors
• Multicore processors
• Shared-memory multiprocessor nodes
• Scalable parallel systems
Image credit: http://www.nersc.gov/news/reports/bluegene.gif
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 42/50
42
Achieving High Performance on Parallel Systems
• Memory latency and bandwidth
— CPU rates have improved 4x asfast as memory over last decade
— bridge speed gap using memory
hierarchy— multicore exacerbates demand
• Interprocessor communication
Computation is only part of the picture
• Input/output
— I/O bandwidth to disk typically grows linearly with # processors
Image Credit: Bob Colwell, ISCA 1995
CPU
system
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 43/50
43
Historical Concurrency in Top 500 Systems
Image credit: http://www.top500.org/overtime/list/36/procclass
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 44/50
Scale of Largest HPC Systems (Nov. 2010)
44Image credit: http://www.top500.org/list/2010/11/100
> 100Kcores
>1015 FLOP/s
almost300Kcores
hybridCPU+GPU
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 45/50
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 46/50
46
Forthcoming NSF “Blue Waters” System
• NSF’s goal for 2006-2011“enable petascale science and engineering through thedeployment and support of a world-class HPC environmentcomprising the most capable combination of HPC assetsavailable to the academic community”
•Scientific aim: support computationally challenging problems
—simulations that are intrinsically multi-scale, or
—simulations involving interaction of multiple processes
• Acquisition: Blue Waters @ UIUC in 2011
—goal: sustained petaflop for science and engineering codes
—$208M award from NSF—system characteristics: powerful cores, high performance
interconnect, large high performance memory subsystem, highperformance I/O system, high reliability
—vendor: IBM
—based on the 4-way SMT, 8-core Power7 processor
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 47/50
Challenges of Parallelism in the Large
• Parallel science applications are often very sophisticated
— e.g. adaptive algorithms may require dynamic load balancing
• Multilevel parallelism is difficult to manage
• Extreme scale exacerbates inefficiencies
— algorithmic scalability losses
— serialization and load imbalance
— communication or I/O bottlenecks
— insufficient or inefficient parallelization
• Hard to achieve top performance even on individual nodes
— contention for shared memory bandwidth
— memory hierarchy utilization on multicore processors
47
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 48/50
48
Thursday’s Class
• Introduction to parallel algorithms—tasks and decomposition
—processes and mapping
—processes versus processors
• Decomposition techniques
—recursive decomposition
—data decomposition
—exploratory decomposition
—hybrid decomposition
• Characteristics of tasks and interactions—task generation, granularity, and context
—characteristics of task interactions
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 49/50
49
Parallel Machines for the Course
• STIC—170 nodes, each with two 4-core Intel Nehalem processors
—Infiniband interconnection network
—no global shared memory
• Biou
—48 nodes, each with four 8-core IBM Power7 processors
– 4-way SMT (4 hardware threads per processor core); 256GB/node
—Infiniband interconnection network
—no global shared memory
8/6/2019 Comp422 2011 Lecture1 Introduction
http://slidepdf.com/reader/full/comp422-2011-lecture1-introduction 50/50
Reading Assignments
• For Thursday—Chapter 1
—Chapter 3: 85-104
• For Tuesday
—Chapter 3: 105-142
50