Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale

Multiprocessors—Multiprocessors—Large vs. Small ScaleLarge vs. Small Scale

Small-Scale MIMD Small-Scale MIMD DesignsDesigns

Memory: centralized with uniform Memory: centralized with uniform memory access time (UMA) and memory access time (UMA) and bus interconnect bus interconnect

Examples: SPARCcenter Examples: SPARCcenter

Large-Scale MIMD Large-Scale MIMD DesignsDesigns

Memory: distributed with non-uniform Memory: distributed with non-uniform memory access time (NUMA) and memory access time (NUMA) and scalable interconnectscalable interconnect

Examples: Cray T3D, Intel Paragon, CM-Examples: Cray T3D, Intel Paragon, CM-5 5

Communication Communication ModelsModels Shared MemoryShared Memory

– Communication via shared address spaceCommunication via shared address space– Advantages:Advantages:

Ease of programmingEase of programming Lower latencyLower latency Easier to use hardware controlled cachingEasier to use hardware controlled caching

Message passingMessage passing– Processors have private memories, Processors have private memories,

communicate via messagescommunicate via messages– Advantages:Advantages:

Less hardware, easier to designLess hardware, easier to design Focuses attention on costly Focuses attention on costly non-localnon-local operations operations

Communication Communication PropertiesProperties

BandwidthBandwidth– Need high bandwidth in communicationNeed high bandwidth in communication– Limits in network, memory, and processorLimits in network, memory, and processor

LatencyLatency– Affects performance, since processor waitAffects performance, since processor wait– Affects ease of programming - How to Affects ease of programming - How to

overlap communication and computation.overlap communication and computation. Latency HidingLatency Hiding

– How can a mechanism help hide latency?How can a mechanism help hide latency?– Examples: overlap message send with Examples: overlap message send with

computation, prefetchcomputation, prefetch

Small-Scale—Shared Small-Scale—Shared MemoryMemory

Caches serve to:Caches serve to:– Increase bandwidth Increase bandwidth

versus bus/memoryversus bus/memory– Reduce latency of accessReduce latency of access– Valuable for both private Valuable for both private

data and shared datadata and shared data What about cache What about cache

consistency?consistency?

The Problem of Cache The Problem of Cache CoherencyCoherency

Value of X in memory is 1Value of X in memory is 1 CPU A reads X – its cache now CPU A reads X – its cache now

contains 1contains 1 CPU B reads X – its cache now CPU B reads X – its cache now

contains 1contains 1 CPU A stores 0 into X CPU A stores 0 into X

– CPU ACPU A’’s cache contains a 0s cache contains a 0– CPU BCPU B’’s cache contains a 1s cache contains a 1

Multicore SystemsMulticore Systems

Multicore Computers (chip multiprocessors)

Combine two or more processors (cores) on a single piece of silicon

Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches

Multithreading is used

Pollack’s Rule

Performance increase is roughly proportional to the square root of the increase in complexity

performance √complexity

Power consumption increase is roughly linearly proportional to the increase in complexity

power consumption complexity

Pollack’s Rule

complexity power performance

1 1 1

4 4 2

25 25 5

100s of low complexity cores, each operating at very low power

Ex: Four small cores

complexity power performance

4x1 4x1 4

Increasing CPU Performance

Manycore Chip Composed of hybrid cores

• Some general purpose

• Some graphics

• Some floating point

Exascale Systems

Board composed of multiple manycore chips sharing memory

A room full of these racks

Millions of coresExascale systems (1018 Flop/s)

Rack composed of multiple boards

Moore’s Law Reinterpreted

Number of cores per chip doubles every 2 years

Number of threads of execution doubles every 2 years

Shared Memory MIMD

Shared memory

• Single address space

• All processes have access to the pool of shared memory

Memory

Bus

P P P P

Shared Memory MIMD

Each processor executes different instructions asynchronously, using different dataM

emor

y

PE

PE

PE

PE

data

data

data

data

instruction

CU

CU

CU

CU

Symmetric Multiprocessors (SMP)

MIMD Shared memory UMA

Proc

L1

L2

Main Memory I/O

I/O

I/O

Proc

L1

L2

…

System bus

Symmetric Multiprocessors (SMP)Characteristics:

Two or more similar processors

Processors share the same memory and I/O facilities

Processors are connected by bus or other internal connection scheme, such that memory access time is the same for each processor

All processors share access to I/O devices

All processors can perform the same functions

The system is controlled by the operating system

Symmetric Multiprocessors (SMP)

Operating system:

Provides tools and functions to exploit the parallelism

Schedules processes or threads across all of the processors

Takes care of

• scheduling of threads and processes on processors

• synchronization among processors

Multicore Computers

Dedicated L1 Cache

(ARM11 MPCore)

CPUcore 1

L1-I

L2

Main Memory

I/O

I/O

I/O

…L1-D

CPUcore n

L1-I L1-D

Multicore Computers

Dedicated L2 Cache

(AMD Opteron)

CPUcore 1

L1-I

L2

Main Memory

I/O

I/O

I/O

…L1-D

CPUcore n

L1-I L1-D

L2

Multicore Computers

Shared L2 Cache

(Intel Core Duo)

CPUcore 1

L1-I

L2

Main Memory

I/O

I/O

I/O

…L1-D

CPUcore n

L1-I L1-D

Multicore Computers

Shared L3 Cache

(Intel Core i7)

CPUcore 1

L1-I

L2

Main Memory

I/O

I/O

I/O

…L1-D

CPUcore n

L1-I L1-D

L2

L3

Multicore Computers

Advantages of Shared L2 cache Reduced overall miss rate

• Thread on one core may cause a frame to be brought into the cache, thread on another core may access the same location that has already been brought into the cache

Data shared by multiple cores is not replicated The amount of shared cache allocated to each core may be dynamic Interprocessor communication is easy to implement

Advantages of Dedicated L2 cache Each core can access its private cache more rapidly

L3 cache When the amount of memory and number of cores grow, L3 cache provides

better performance

Multicore Computers

On-chip interconnects Bus Crossbar

Off-chip communication (CPU-to-CPU or I/O): Bus-based

Multicore Computers (chip multiprocessors)

Combine two or more processors (cores) on a single piece of silicon

Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches

Multithreading is used

Multicore Computers

Multithreading

A multithreaded processor provides a separate PC for each thread (hardware multithreading)

Implicit multithreading• Concurrent execution of multiple threads extracted from a single sequential

program

Explicit multithreading• Execute instructions from different explicit threads by interleaving

instructions from different threads on shared or parallel pipelines

Multicore Computers Explicit Multithreading

Fine-grained multithreading (Interleaved multithreading)• Processor deals with two or more thread contexts at a time• Switching from one thread to another at each clock cycle

Coarse-grained multithreading (Blocked multithreading)• Instructions of a thread are executed sequentially until an event that causes a delay

(eg. cache miss) occurs• This event causes a switch to another thread

Simultaneous multithreading (SMT)• Instructions are simultaneously issued from multiple threads to the execution units

of a superscalar processor• Thread-level parallelism is combined with instruction-level parallelism (ILP)

Chip multiprocessing (CMP)• Each processor of a multicore system handles separate threads

Coarse-grained, Fine-grained, Symmetric Multithreading, CMP

GPUs (Graphics Processing Units)

Characteristics of GPUs

GPUs are accelerators for CPUs

SIMD

GPUs have many parallel processors and many concurrent threads (i.e. 10 or more cores; 100s or 1000s of threads per core)

CPU-GPU combination is an example for heterogeneous computing

GPGPU (general purpose GPU): using a GPU to perform applications traditionally handled by the CPU

GPUs

Documents

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale