Course Outline

Course Outline• Introduction in algorithms and applications• Parallel machines and architectures

Overview of parallel machines, trends in top-500, clusters, many-cores

• Programming methods, languages, and environmentsMessage passing (SR, MPI, Java)Higher-level language: HPF

• ApplicationsN-body problems, search algorithms

• Many-core (GPU) programming (Rob van Nieuwpoort)

Parallel Machines

Parallel Computing – Techniques and Applications UsingNetworked Workstations and Parallel Computers (2/e)

Section 1.3 + (part of) 1.4Barry Wilkinson and Michael Allen

Pearson, 2005

Overview

• Processor organizations• Types of parallel machines

– Processor arrays– Shared-memory multiprocessors– Distributed-memory multicomputers

• Cluster computers• Blue Gene

Processor Organization

• Network topology is a graph– A node is a processor– An edge is a communication path

• Evaluation criteria– Diameter (maximum distance)– Bisection width (minimum number of edges

that should be removed to splitthe graph into 2 -almost- equal halves)

– Number of edges per node

Key issues in network design

• Bandwidth:– Number of bits transferred per second

• Latency:– Network latency: time to make a message transfer through the

network– Communication latency: total time to send the message, including

software overhead and interface delays– Message latency (startup time): time to send a zero-length

message

• Diameter influences latency• Bisection width influences bisection bandwidth (collective

bandwidth over the ``removed’’ edges)

Mesh

q-dimensional latticeq=2 -> 2-D grid

Number of nodes: k² (k = SQRT(p))

Diameter 2(k - 1)Bisection width kEdges per node 4

Binary Tree

• Number of nodes: 2k – 1 (k = 2LOG(P))• Diameter: 2 (k -1)• Bisection width: 1• Edges per node: 3

Hypertree

• Tree with multiple roots, gives better bisection width• 4-ary tree:

– Number of nodes 2k ( 2 k+1 - 1)– Diameter 2k– Bisection width 2 k+1

– Edges per node 6

Engineering solution: fat tree

• Tree with more bandwidth at links near the root

• CM-5

• CM-5

Hypercube

• k-dimensional cube, each node has binary value, nodes that differ in 1 bit are connected

• Number of nodes 2k

• Diameter k• Bisection width 2k-1 • Edges per node k

Hypercube

• Label nodes with binary value, connect nodes that differ in 1 coordinate

• Number of nodes 2k • Diameter k• Bisection width 2k-1

• Edges per node k

ComparisonMesh Tree Hypercube

Diameter o + +

Bisection width o - +

#edges 4 3 unlimited

Types of parallel machines

• Processor arrays

• Shared-memory multiprocessors

• Distributed-memory multicomputers

Processor Arrays

• Instructions operate on scalars or vectors• Processor array = front-end + synchronized

processing elements

Processor Arrays

•Front-end•Sequential machine that executes program•Vector operations are broadcast to PEs

•Processing element•Performs operation on its part of the vector•Communicates with other PEs through a network

Examples of Processor Arrays

• CM-200, Maspar MP-1, MP-2, ICL DAP (~1970s)• Earth Simulator (Japan, 2002, former #1 of top-500)• Ideas are now applied in GPUs and CPU-extensions

like MMX

Shared-Memory Multiprocessors

• Bus easily gets saturated => add caches to CPUs• Central problem: cache coherency

– Snooping cache: monitor bus, invalidate copy on write– Write-through or copy-back

• Bus-based multiprocessors do not scale

Other Multiprocessor Designs (1/2)

• Switch-based multiprocessors (e.g., crossbar)

• Expensive (requires many very fast components)

Other Multiprocessor Designs (2/2)

• Non-Uniform Memory Access (NUMA) multiprocessors

• Memory is distributed• Some memory is faster to

access than other memory• Example:

– Teras at Sara, Dutch NationalSupercomputer (1024-node SGI)

• Ideas now applied in multi-cores

Distributed-Memory Multicomputers

• Each processor only has a local memory• Processors communicate by sending messages over

a network• Routing of messages:

– Packet-switched message routing: split message into packets, buffered at intermediate nodes• Store-and-forward

– Circuit-switched message routing: establish path between source and destination

Packet-switched Message Routing

• Messages are forwarded one node at a time• Forwarding is done in software• Every processor on path from source to destination is

involved• Latency linear to distance x message length

– Old examples: Parsytec GCel (T800 transputers), Intel Ipsc

Circuit-switched Message Routing

• Each node has a routing module• Circuit set up between source and

destination

• Latency linear to distance + message length

• Example: Intel iPSC/2

Modern routing techniques

• Circuit switching: needs to reserve all links in the path (cf. old telephone system)

• Packet switching: high latency, buffering space (cf. postal mail)

• Cut-through routing: packet switching, but immediately forward (without buffering) packets if outgoing link is available

• Wormhole routing: transmit head (few bits) of message, rest follows like a worm

Performance

Distance (number of hops)

Wormhole routingCircuit switching

Packet switching

Net

wor

k la

tenc

y

Distributed Shared Memory

• Shared memory is easier to program, but doesn’t scale• Distributed memory is hard to program, but does scale• Distributed Shared Memory (DSM): provide shared-

memory programming model on top of distributed memory hardware– Shared Virtual Memory (SVM): use memory management

hardware (paging), copy pages over the network– Object-based: provide replicated shared objects (Orca language)

• Was hot research topic in 1990s, but performance remained the bottleneck

Flynn's Taxonomy

• Instruction stream: sequence of instructions• Data stream: sequence of data manipulated by

instructions

Single Data Multiple Data

Single Instruction

SISD SIMD

Multiple Instruction

MISD MIMD

Flynn's TaxonomySingle Data Multiple

DataSingle Instruction

SISD SIMD

Multiple Instruction

MISD MIMD

SISD: Single Instruction Single DataTraditional uniprocessors

SIMD: Single Instruction Multiple DataProcessor arraysMISD: Multiple Instruction Single DataNonexistent?MIMD: Multiple Instruction Multiple Data Multiprocessors and multicomputers

Documents

Course Outline