Download pdf - ECE5917 SoC Architecture: MP SoC –Part 1contents.kocw.net/KOCW/document/2014/sungkyunkwan/hantaehee/6.pdf · Recall: How to alleviate the Memory Wall Problem nHiding/Reducing the

ECE5917SoC Architecture: MP SoC – Part 1

Tae Hee Han: [email protected]

Semiconductor Systems Engineering

Sungkyunkwan University

Outline

n Overview

n Parallelismn Data-Level Parallelismn Instruction-Level Parallelismn Thread-Level Parallelismn Processor-Level Parallelism

n Multi-core

2

Overview

3

Where Are We Headed?

4

0.01

0.1

1

10

100

1000

10000

100000

1000000

1970 1975 1980 1985 1990 1995 2000 2005 2010

MIP

S

Speculative, OOO

Era of Instruction

LevelParallelism

Superscalar

486386

2868086

Multithread

Era of Thread &Processor

LevelParallelism

Special Purpose

HW

Multithread, Multi-core

Single-chip CPU Era (~ 2004)

SIMD-extension

CPU-GPU Fusion

Pipelining

ø Time frame is popularity based. (Not based on first appearance)

Where Are We Headed? (Intel – AMD Architecture Transition)

5

2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

IntelDesktop

&Server

IntelMobile

AMDDesktop

&Server

130nm Tualatin

180nm K7

130nm Banias 90nm Dothan 65nm

Yonah

130nm K7 130nm K8 90nm K8 65nm K8 65nm K10 45nm K10 32nm Bulldozer

System-level Integration EraSystem-level

Integration EraMulti-Core EraSingle-Core Era

single-corecrisis

single-corecrisis

CELLShockCELLShock

P6 (Pentium M)

K10 (K8L) BulldozerK8K7

1-core 2-core 4-core >= 6-core

2-core1-core

130nm Northwood/Gallatin 90nm Prescott / Smithfield

65nm Cedar Mill /

Presler65nm Merom 45nm

Penryn 45nm Nehalem 32nm Westmere

32nm Sandy Bridge

22nm Ivy Bridge

NetBurst Core Nehalem Sandy Bridge

>= 6-core4-core4-core (MCM)2-core2-core (MCM)1-core

P6 (Pentium III)

Processor Architectures: Flynn’s Classification n SISD: Single Instruction, Single Data stream

n Uniprocessor

n SIMD: Single Instruction, Multiple Data streamsn Same instruction executed by multiple processing unitsn e.g.: multimedia processors, vector architectures

n MISD: Multiple Instruction, Single Data streamn Successive functional units operate on the same stream of

datan Rarely found in general-purpose commercial designs

n MIMD: Multiple Instruction, Multiple Data streamsn Each processor has its own instruction and data streamsn Most popular form of parallel processing

n Single-user: high-performance for one applicationn Multiprogrammed: running many tasks simultaneously (e.g.,

servers)

6

Instruction Pool

Data

Poo

l

PU

SISD

Instruction Pool

Data

Poo

l

SIMD

PU

PU

PU

PU

System-level Integration (Chuck Moore, AMD at MICRO 2008)

n Single-chip CPU Era: 1986 –2004n Extreme focus on single-threaded performancen Multi-issue, out-of-order execution plus moderate cache hierarchy

n Chip Multiprocessor (CMP) Era: 2004 –2010n Early: Hasty integration of multiple cores into same chip/packagen Mid-life: Address some of the HW scalability and (memory) interference issuesn Current: Homogeneous CPUs plus moderate system-level functionality

n System-level Integration Era: ~2010 onwardn Integration of substantial system-level functionalityn Heterogeneous processors and acceleratorsn Introspective control systems for managing on-chip resources & events

7

Challenges!: Chuck Moore (AMD, 2011)

8

Time

Inte

grat

ion

(log

sc

ale)

?

We are here

• DFM• Variability• Reliability• Wire delay

Moore’s Law

Issue Width

IPC

We are here

ILP complexity Wall

Time

Pow

er B

udge

t (TD

P)

We are here

• Server: power = $$• DT: eliminate fans• Mobile: battery

Power Wall

Cache Size

Perfo

rman

ce

We are here

Locality

Time

Freq

uenc

y

We are here

Frequency Wall

Time

Sing

le-t

hrea

d Pe

rf. ?

We are here

Single Thread Performance

Three Walls to Serial Performance

n Memory Wall

n Instruction Level Parallelism (ILP) Wall

n Power Wall

9

Source: excellent article, “The Many-Core Inflection Point for Mass Market Computer Systems”, by John L. Manferdelli, Microsoft Corporation

http://www.ctwatch.org/quarterly/articles/2007/02/the-many-core-inflection-point-for-mass-market-computer-systems/

Recall: Memory Wall

n Processor – Memory(DRAM) Performance Gap!n DRAM: A 1-cycle access in 1980 takes 100s of cycles in 2010n Registers: fast but small and expensiven We want: fast, large, and cheap memory

10

Recall: Typical Memory Hierarchy

11

Larger,Slower,Cheaper

External Secondary Storage(External HDD, Tape, CD/DVD, Cloud Server)

Smaller,Faster,Costlier

Register(F/F)

L1 $ (SRAM)

L2 – L3 $ (SRAM)

Data Storage (HDD)

Main memory (DRAM)

» 105

Performance GapFlash Cache

SSD

Random Access (Read) Latency Type Access time Capacity Managed by

Register 1 cycle » 500~1,000B Compiler

L1 cache » 3~4 cycles » 64KB HW

L2 cache » 10~30 cycles » 256KB HW

L3 cache » 30~60 cycles » 2~8MB HW

Main Memory

» 100~300 cycles

512~4GB (mobile) /

4~16GB (PC)OS

Flash storage » 5K~10K cycles

8~32GB (mobile) /

128~512GB (PC)OS/Operator

HDD » 10M~20M cycles > 1TB (PC) OS/Operator

Recall: How to alleviate the Memory Wall Problem

n Hiding/Reducing the memory access latency

n Holistic approachn Caches, Local memory, DRAM stacking,

HW/SW prefetching, Data locality optimization, Memory controller, SMT

n Increasing the bandwidthn Latency helps BW, but not vice versa

n Reducing the number of memory accesses

n keeping as much reusable data on cache and local memory as possible

12

ILP Wall

n Duplicate hardware speculatively executes future instructions before the results of current instructions are known, while providing hardware safeguards to prevent the errors that might be caused by out of order execution

n Branches must be “guessed” to decide what instructions to execute simultaneously

n If you guess wrong, you throw away this part of the result

n Data dependencies may prevent successive instructions from executing in parallel, even if there are no branches

13

1. e = a + b 2. f = c + d 3. g = e ´ f

Power Wall

14

• Intel 80386 consumed ~ 2 W• 3.3 GHz Intel Core i7 consumes 130 W• Heat must be dissipated from 1.5 x 1.5 cm chip• This is the limit of what can be cooled by air

Active powerStandby power

v Memory Wall

v ILP Wall

v Power Wall

n Moore’s Lawn Transistor density increases

every 18~24 months

n CMOS Powern Ptotal = V2 f C a + V Ileakage

n Drastic increase in leakage current and decrease in noise margin prevent the voltage scaling around 1V

Limitations in Processor Performance Not only Battery, but also Heat!

Power Wall

n Power dissipation in clocked digital devices is proportional to the clock frequency, imposing a natural limit on clock rates

n Significant increase in clock speed without heroic (and expensive) cooling is not possible à Chips would simply melt

n Clock speed increased by a factor of 1,000 during last two decadesn The ability of manufacturers to dissipate heat is limited though…n Look back at the last five years, the clock rates are pretty much flat

n You could bank on Materials Science (MS) breakthroughsn The MS Engineers have usually delivered, can they keep doing it ??

15

Pollack’s Rule: Trade-offs

16

CMOS Process Technology (mm)

Area(Lead / Compaction)

0

1

2

3

4

1.5 1 0.7 0.5 0.35 0.18

improvement (X)

Performance(Lead / Compaction)

Pollack’s RulePollack’s Rule:"performance increase due to m-architecture advances is roughly proportional to [the] square root of [the] increase in complexity“

Implications (in the same technology)• New m-Arch consumes about 2-3x die area

of the last m-Arch, but provides 1.5-1.7x performance

Multi-core

n Put multiple CPU’s on the same die

n Why is this better than multiple dies?n Smaller, Cheapern Closer, so lower inter-processor latencyn Can share a L2 Cache (complicated)n Less power

n Cost of multi-core:n Complexityn Slower single-thread execution

17

Creating Parallel Processing Programs

n It is difficult to write SW that uses multiple processors to complete one task faster, and the problem gets worse as the number of processors increases

n The first reason is that you must get better performance and efficiency from a parallel processing program on a multiprocessor

n Think an analogy of eight reporters trying to write a single story in hopes of doing the work eight times faster

18

But … (Fortunately)

n With the rise of the Internet and rich multimedia applications, the need for handling independent tasks and huge data increased dramatically

à Task Level Parallelism and Data Level Parallelism

n User computing environment is changing to include many “background” tasks

n Multiprocessors can speed up these types of applications with the help of tighter integration of cores and multithreading

19

Multi-core vs. Manycore

n Multi-core: current trajectoryn Stay with current fastest core designn Replicate every 18 months (2, 4, 8 . . . Etc…)n Advantage: Do not alienate serial workloadn Example: AMD X2 (2 core), Intel Core2 Quad (4 cores), AMD Barcelona (4 cores)

n Manycore: converging in this directionn Simplify cores (shorter pipelines, lower clock frequencies, in-order processing)n Start at 100s of cores and replicate every 18 monthsn Advantage: easier verification, defect tolerance, highest compute/surface-area, best power

efficiencyn Examples: Cell SPE (8 cores), Nvidia G80 (128 cores), Intel Polaris (80 cores), Cisco/Tensilica

Metro (188 cores)

n Convergence: Ultimately toward Manycoren Manycore if we can figure out how to program it! n Hedge: Heterogeneous Multi-core

20

Manycore System: CPU or GPU

n CPUn Large cache and sophisticated flow control minimize latency for arbitrary

memory access for serial process

n GPUn Simple flow control and limited cache, more transistors for computing in

paralleln High arithmetic intensity hides memory latency

21

DRAM

Cache

ALUControl

ALU

ALU

ALU

DRAM

CPU GPU Source: NVIDIA

How Small is “Small”

n Power5 (Server)n 389mm2

n 120W@1900MHz

n Intel Core2 sc (laptop)n 130mm2

n 15W@1000MHz

n ARM Cortex A8 (automobiles)n 5mm2

n 0.8W@800MHz

n Tensilica DP (cell phones / printers)n 0.8mm2

n 0.09W@600MHz

n Tensilica Xtensa (Cisco router)n 0.32mm2 for 3!n 0.05W@600MHz

Intel Core2

ARM

TensilicaDP

Xtensa x 3

Power 5

Each core operates at 1/3 to 1/10th efficiency of largest chip, but you can pack 100x more cores onto a chip and consume 1/20 the power

22

More Concurrency: Design for Low Power

n Cubic power improvement with lower clock rate due to V2F

n Slower clock rates enable use of simpler cores

n Simpler cores use less area (lower leakage) and reduce cost

n Tailor design to application to reduce waste

Intel Core2

ARM

TensilicaDP

Xtensa x 3

Power 5

This is how iPhones and MP3 players are designed to maximize battery life and minimize cost

23

Tension between Concurrency and Power Efficiency

n Highly concurrent systems can be more power efficient n Dynamic power is proportional to V2fCn Build systems with even higher concurrency?

n However, many algorithms are unable to exploit massive concurrency yet

n If higher concurrency cannot deliver faster time to solution, then power efficiency benefit wasted

n So we should build fewer/faster processors?

24

Path to Power Efficiency: Reducing Waste in Computing

n Examine methodology of low-power embedded computing marketn optimized for low power, low cost, and high computational

efficiency

“Years of research in low-power embedded computing have shown only one design technique to reduce power: reduce waste.”

¾ Mark Horowitz, Stanford University & Rambus Inc.

n Sources of Wasten Wasted transistors (surface area)n Wasted computation (useless work/speculation/stalls)n Wasted bandwidth (data movement)n Designing for serial performance

25

What’s Next?

Source: Jack Dongarra, Intl. Supercomputing Conf. (ISC) 2008

26

Memory

+ 3D Stacked Memory

Many Floating-Point Cores

All Large CoreMixed Large

andSmall Core

All Small Core

Many Small Cores

Different Classes of ChipsHomeGames / GraphicsBusiness Scientific

Different Classes of ChipsHomeGames / GraphicsBusiness Scientific

The question is not whether this will happen but whether we are ready

Intel Single-chip Cloud Computer (Dec. 2009)

27

Parallelism- Introduction

28

Little’s Law

29

n Throughput (T) = Number-in-flight (N) / Latency (L)n Example: 4 floating-point registers, 8 cycles per floating-point opn Little’s Law à ½ issue per cycle

WBIssue Execution

Basic Performance Quantities

n Latency:n Every operation requires time to

executen i.e. instruction, memory or

network latency

n Bandwidth:n # of (parallel) operations

completed per cyclen i.e. # of FPUs, DRAM, Network,

etc…

n Concurrency:n Total # of operations in flight

n Little’s Law relates these three:n Concurrency = Latency ´

Bandwidth - or -n Effective Throughput = Expressed

Concurrency / Latency

n Concurrency must be filled with parallel operations

n Can’t exceed peak throughput with superfluous concurrency

n Each channel has a maximum (limited) throughput

30

Performance Optimization: Contending Forces

n Contending forces of device efficiency and usage/traffic

n Often boils down to several key challenges:n Management of data/task localityn Management of data dependenciesn Management of communicationn Management of variable and dynamic parallelism

31

Improve throughput

Reduce Volume of

Data

Restructure to satisfy

Little’s Law

Implementation & Algorithmic Optimization

Classes of Parallelism and Parallel Architectures (1/2)

n Basically two kinds of parallelism in applications:n Data-level parallelism (DLP)

n There are many data items that can be operated on at the same timen Task-level parallelism (TLP)

n Tasks of work are created that can operate independently and largely in parallel

32

Source: Computer Architecture 5th ed.: A Quantitative Approach(Morgan Kaufmann, by Hennessy & Patterson, 2011)

Classes of Parallelism and Parallel Architectures (2/2)

n Computer HW in turn can exploit these two kinds of application parallelism in four major ways:

n Instruction-level parallelismn Exploits DLP at modest levels with compiler help using ideas like pipelining and at

medium levels using ideas like speculative executionn Vector architectures and GPUs

n Exploits DLP by applying a single instruction to a collection of data in parallel (SIMD)

n Thread-level parallelismn Exploits either DLP or TLP in a tightly coupled hardware model that allows for

interaction among parallel threadsn Request-level parallelism

n Exploits parallelism among largely decoupled tasks specified by the programmer or the OS

33

Source: Computer Architecture 5th ed.: A Quantitative Approach(Morgan Kaufmann, by Hennessy & Patterson, 2011)

Uses of Parallelism

n “Horizontal” parallelism for throughput

n More units working in parallel

n “Vertical” parallelism for latency hiding

n Pipelining: keep units busy when waiting for dependencies of resource, data, and control

34

A B C D

Throughput

A B C D

Late

ncyA B C

A B

A

Program Execution Time

n Latency metric: program execution time in seconds

n Your system architecture can affect all of themn CPI (Cycles per instructions): memory latency, IO latency, …n CCT (clock freq.): cache org., power budget, …n IC (Instruction count): OS overhead, compiler choice …

35

Independent?

CCTCPIIC Cycle

SecondsnInstructio

CyclesProgram

nsInstructio

CycleSeconds

ProgramCycles

ProgramSecondsCPUtime

××=

××=

×==

Architecture Methods for Performance Enhancement

n Powerful instructionsn MD-technique

n Multiple data operands per operation: SIMD (Vector, Sub-word SIMD Extension)

n MO-techniquen Multiple operations per instruction: Sophisticated ISA (e.g. CISC-like),

VLIW

n Pipelining

n Multiple instruction issuen Single stream: Superscalar n Multiple streams

n Multithreading, Multi-core

36

Powerful Instructions – MD Technique

n MD-techniquen Multiple data operands per operationn SIMD: Single Instruction Multiple Data

37

Vector instruction:

for (i=0, i++, i<64)c[i] = a[i] + 5*b[i];

or

c = a + 5*b

Assembly:

Set vl,64Ldv v1,0(r2)Mulvi v2,v1,5Ldv v1,0(r1)Addv v3,v1,v2Stv v3,0(r3)


n SIMD computing

n All PEs (Processing Elements) execute same operation

n Typical mesh or hypercube connectivity

n Exploit data locality of e.g. image processing applications

n Dense encoding (few instruction bits needed)

38

SIMD Execution Method

time

Instruction 1

Instruction 2

Instruction 3

Instruction n

PE1 PE2 PEn


n Sub-word parallelismn SIMD on restricted scale for

Multimedia instructionsn short vectors added to existing ISAs for

microprocessorsn Examples: Intel MMX/SSE/AVX, ARM

NEON, AMD 3Dnow

39

´ ´ ´ ´

Powerful Instructions – MO Technique

n MO-technique: multiple operations per instruction

n Two options:n CISC (Complex Instruction Set Computer)n VLIW (Very Long Instruction Word)

sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5)

FU 1 FU 2 FU 3 FU 4field

instruction bnez r5, 13

FU 5

VLIW instruction example

40

Parallelism- Data Level Parallelism

41

Recall: Flynn’s Classification of Processor Architecture n SISD: Single Instruction, Single Data stream

n Uniprocessor

n SIMD: Single Instruction, Multiple Data streamsn Same instruction executed by multiple processing unitsn e.g.: multimedia processors, vector architectures

n MISD: Multiple Instruction, Single Data streamn Successive functional units operate on the same stream of

datan Rarely found in general-purpose commercial designs

n MIMD: Multiple Instruction, Multiple Data streamsn Each processor has its own instruction and data streamsn Most popular form of parallel processing

n Single-user: high-performance for one applicationn Multiprogrammed: running many tasks simultaneously (e.g.,

servers)

42

Instruction Pool

Data

Poo

l

PU

SISD

Instruction Pool

Data

Poo

l

SIMD

PU

PU

PU

PU

Data-level Parallelism

n Data parallelism focuses on distributing the data across different parallel computing nodes, which is usually found in:

n Multimedia Computingn Identical ops on streams or arrays of sound samples,

pixels, video framesn Scientific Computing

n Weather forecasting, car-crash simulation, biological modeling

43

DLP Kernel dominate many Computational Workloads

44

DLP and Throughput Computing

45Source: Chuck Moore (AMD, 2011)

Data Parallelism & Loop Level Parallelism (LLP)

n Data Parallelism:n Similar independent/parallel computations on different elements of arrays

that usually result in independent (or parallel) loop iterations

n A common way to increase parallelism among instructions is to exploit data parallelism among independent iterations of a loop: exploit Loop Level Parallelism (LLP)

n By unrolling the loop either statically by the compiler, or dynamically by hardware, which increases the size of the basic block present

n This resulting larger basic block provides more instructions that can be scheduled or re-ordered by the compiler/hardware to eliminate more stall cycles

46

for (i=1; i<=1000; i=i+1;) x[i] = x[i] + y[i];

4 vector instructions:Load Vector XLoad Vector YAdd Vector X, X, YStore Vector X

LVLVADDVSV

Resurgence of DLP

n Convergence of application demands and technology constraints drives architecture choice

n New applications, such as graphics, machine vision, speech recognition, machine learning, etc. all require large numerical computations that are often trivially data parallel

n SIMD-based architectures (Vector-SIMD, subword-SIMD, SIMT/GPUs) are most efficient way to execute these algorithms

47

SIMD Classifications

n Vector architectures

n SIMD extensions (sub-word SIMD)n E.g) Intel - MMX: Multimedia Extensions (1996), SSE: Streaming

SIMD Extensions (1999), AVX: Advanced Vector Extension (2010)

n Graphics Processing Units (GPUs)

48

Vector Architectures

n Basic idea:n Read sets of data elements into “vector registers”n Operate on those registersn Disperse the results back into memory

n Registers are controlled by compilern Register files act as compiler controlled buffersn Used to hide memory latencyn Leverage memory bandwidth

n Vector loads/stores deeply pipelinedn Pay for memory latency once per vector ld/st!

n Regular loads/storesn Pay for memory latency for each vector element

49

+

r1

r3

r2

SCALAR(1 operation)

add r3, r1, r2

VECTOR(N operations)

vadd.vv v3, v1, v2

+

Rs1

Rd

Rs2

+

Rs1

Rd

Rs2

+

Rs1

Rd

Rs2

+

Rs1

Rd

Rs2

+

Rs1

Rd

Rs2

+

v1

v3

v2

Vector length

Vector Programming Model

50

+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic InstructionsADDV v3, v1, v2

v3

v2v1

Scalar Registers

r0

r15Vector Registers

v0

v15

[0] [1] [2] [VLRMAX-1]

VLRVector Length Register

v1Vector Load & Store

InstructionsLV v1, r1, r2

Base, r1 Stride, r2Memory

Vector Register

Multiple Datapaths

n Vector elements interleaved across lanesn Example: V[0, 4, 8, …] on Lane 1, V[1, 5, 9,…] on Lane 2, etc.

n Compute for multiple elements per cycle n Example: Lane 1 computes on V[0] and V[4] in one cycle

n Modular, scalable designn No inter-lane communication needed for most vector instructions

51

Vector Processors (I)

n A vector is a one-dimensional array of numbers

n Many scientific/commercial programs use vectorsfor (i = 0; i<=49; i++)C[i] = (A[i] + B[i]) / 2;

n A vector processor is one whose instructions operate on vectors rather than scalar (single data) values

n Basic requirementsn Need to load/store vectors à vector registers (contain vectors)n Need to operate on vectors of different lengths à vector length register

(VLEN)n Elements of a vector might be stored apart from each other in memory à

vector stride register (VSTR)n Stride: distance between two elements of a vector

52

Vector Processors (II)

n A vector instruction performs an operation on each element in consecutive cycles

n Vector functional units are pipelinedn Each pipeline stage operates on a different data element

n Vector instructions allow deeper pipelinesn No intra-vector dependencies à no hardware interlocking within

a vectorn No control flow within a vectorn Known stride allows prefetching of vectors into cache/memory

53

Vector Processor Pros

n No dependencies within a vector n Pipelining, parallelization work welln Can have very deep pipelines, no dependencies!

n Each instruction generates a lot of work n Reduces instruction fetch bandwidth

n Highly regular memory access pattern n Interleaving multiple banks for higher memory bandwidthn Prefetching

n No need to explicitly code loopsn Fewer branches in the instruction sequence

54

Vector Processor Cons

n Still requires a traditional scalar unit (integer and FP) for the non-vector operations

n Difficult to maintain precise interrupts (can’t rollback all the individual operations already completed)

n Compiler or programmer has to vectorize programs

n Not very efficient for small vector sizes

n Not suitable/efficient for many different classes of applications

n Requires a specialized, high-bandwidth, memory systemn Usually built around heavily banked memory with data

interleaving55

Vector Processor Limitations

n Performance of a vector instruction depends on the length of the operand vectors

n Initiation raten Rate at which individual operations can start in a functional unitn For fully pipelined units this is one operation per cycle

n Start-up time (latency)n Time it takes to produce the first element of the resultn Depends on how deep the pipeline of the functional units aren Especially large for load/store unit

56

Multimedia SIMD Extensions

n Key ideas:n Media applications operate on data types narrower than the

native word sizen Video & Graphics systems use 8 bits per primary colorn Audio samples use 8-16 bitsn No memories associated with ALU’s, but a pool of relatively wide (64 to

256 bits) registers that store several narrower operandsn E.g) 256-bit adder: 16 simultaneous operations on 16 bits, 32 simultaneous

operations on 8 bits

n No direct communication between ALU’s, but via registers and with special shuffling/permutation instructions

n Not co-processors or supercomputers, but tightly integrated into CPU pipeline

57

Multimedia SIMD Extensions

n Meant for programmers to utilize

n Not for compilers to generaten Recent x86 compilers

n Capable for FP intensive apps

n Why is it popular? n Costs little to add to the standard arithmetic unitn Easy to implementn Need smaller memory bandwidth than vectorn Separate data transfers aligned in memory

n Vector: single instruction, 64 memory accesses, page fault in the middle of the vector likely!

n Use much smaller register spacen Fewer operandsn No need for sophisticated mechanisms of vector architecture

58

Multimedia Extensions (aka SIMD extensions)

n Very short vectors added to existing ISAs for microprocessors

n Use existing wide-bit register split into small-bit registersn Lincoln Labs TX-2 from 1957 had 36b datapath split into 2´18b or 4´9bn Newer designs have wider registers

n 128b for PowerPC Altivec, Intel SSE2/3/4n 256b for Intel AVX

n Single instruction operates on all elements within a register

59

16b 16b 16b 16b32b 32b

64b

8b 8b 8b 8b 8b 8b 8b 8b

16b 16b 16b 16b

16b 16b 16b 16b

16b 16b 16b 16b

+ + + +4´16b adds

SIMD Multimedia Extensions like SSE-4

n At the core of multimedia extensionsn SIMD parallelismn Variable-sized data fields:

n Vector length = register width / type size

Sixteen 8-bit Operands

Eight 16-bit Operands

Four 32-bit Operands

V31

...

V0V1V2V3V4V5

WIDE UNIT

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 1 2 3 4 5 6 7

0 1 2 3

60

Multimedia Extensions versus Vectors

n Limited instruction set:n no vector length controln no strided load/store or scatter/gathern unit-stride loads must be aligned to 64/128-bit boundary

n Limited vector register length:n requires superscalar dispatch to keep multiply/add/load units busyn loop unrolling to hide latencies increases register pressure

n Trend towards fuller vector support in microprocessorsn Better support for misaligned memory accessesn Support of double-precision (64-bit floating-point)n New Intel AVX spec (announced April 2008), 256b vector registers

(expandable up to 1024b)

61

Parallelism- Instruction Level Parallelism

62

ILP?

n Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously

n The potential overlap among instructions is called instruction level parallelismn There are two approaches to instruction level parallelism:

n Dynamic approach where mainly hardware locates the parallelism à Superscalarn Static approach that largely relies on software to locate parallelism à VLIW (Very

Long Instruction Word)

n How much ILP exists in programs is very application specificn In certain fields, such as graphics and scientific computing the amount can

be very largen However, workloads such as cryptography exhibit much less parallelism

63

ILP vs. PLP

n ILP (Instruction-Level-Parallelism)n Overlap individual machine operations (add, mul, load…) so that

they execute in parallel

n PLP (Processor-Level Parallelism)n Having separate processors getting separate chunks of the

program ( processors programmed to do so)

64

Micro-architectural Techniques for ILP

n Instruction pipelining

n Superscalar or VLIWn Multiple execution units are used to execute multiple

instructions in parallel

n Out-of-Order executionn Note that this technique is independent of both pipelining and

superscalarn Register renaming is used to enable out-of-order execution

n Speculative executionn Execution of complete instructions or parts of instructions before

being certain whether this execution should take placen Branch prediction is used with speculative execution

65

Micro-architectural Techniques for ILP

n Modern processor techniquesn Deep pipelinesn Superscalar issuen Out-of-order, speculative

executionn Branch predictionn Register renaming, dataflow

order

n Execution flown In order, speculative fetchn Out of order executen In order commit

n Using reorder buffer for precise exceptions

66

Fetch Unit

Decode / Rename

Retire

I-Cache

Dispatch

Branch Prediction

Int Int Float Float L/S L/S

D-Cache

Instruction (fetch) buffer

Reservation stations

Reorder buffer

Write buffer

In O

rder

In O

rder

Out

of O

rder

ILP (Parallel Instruction Execution) Constraints

67

ILP Constraints

Structural Dependence(Resource Contention)

Code Dependences(Sequential Semantics of the Program)

Control Dependences Data Dependences

(RAW) True DependencesStorage Conflicts

(not in In-Order Processors)

(WAR) Anti-Dependences (WAW) Output Dependences

Types of Dependencies

n Structural Dependence (Structural Hazard) – HW perspective

n Code Dependence – SW (Program) perspectiven Data dependence (Data Hazard)

n Data True dependencen Name dependencies

n Output dependencen Anti-dependence

n Control Dependence (Control Hazard)

68

Note) H/W terminology Hazards, S/W terminology Dependencies

Visualizing Pipelining

69

Reg Reg

ALU DMemIfetch

Reg Reg

ALU DMemIfetch

Reg Reg

ALU DMemIfetch

Instr.

Order

Time (clock cycles)

Reg Reg

ALU DMemIfetch

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5

Pipelining

n Overlaps execution of instructions by exploiting “Instruction Level Parallelism”

n Recall that

n Pipelining became universal technique in 1985

n Performance Enhancementn Reduce the number of instructions per program (IC)n Reduce the number of cycles per instruction (CPI)n Reduce the number of seconds per cycle (CCT)

n Pipelining can reduce CCT & (effective) CPI

70

CCTCPIIC Cycle

SecondsnInstructio

CyclesProgram

nsInstructioCycle

SecondsProgram

CyclesProgramSeconds(Latency)CPUtime

××=

××=×==

Given ISA, it fully depends on SW (Compiler, Programmer)

Mostly depends on HW organization & implementation technology under system requirements

Pipelining is not quite that easy!

n Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle

n Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away)

n Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock)

n Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)

71

Note) H/W terminology Hazards, S/W terminology Dependencies

Reg Reg

ALU DMemIfetch

Reg Reg

ALU DMemIfetch

Reg Reg

ALU DMemIfetch

Reg Reg

ALU DMemIfetch

Reg Reg

ALU DMemIfetch

Structural Hazards

72

When two or moredifferent instructionswant to use samehardware resource insame cycle

e.g., MEM uses thesame memory portas IF as shown in thisslide.

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5

Structural Hazards

n Structural hazards are reduced with these rules:

n Each instruction uses a resource at most once

n Always use the resource in the same pipeline stage

n Use the resource for one cycle only

n ISAs designed with this in mind

n Sometimes very complex to do this

n Heavily depends on programs and hardware resources

n Some common Structural Hazards:

n Memory access conflictn Floating point - Since many

floating point instructions require many cycles, it’s easy for them to interfere with each other

n Starting up more of one type of instruction than there are resources

73

Data Hazards

74

The use of the result of the ADD instruction in the next three instructions causes a hazard, since the register is not written until after those instructions read it.

Instr.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg ALU DMemIfetch Reg





Time (clock cycles)

IF ID/RF EX MEM WB

Data Hazards

n Read After Write (RAW)n Caused by a dependence, need for

communicationn Instr-J tries to read operand before

Instr-I writes itI : add r1, r2, r3J : sub r4, r1, 43

n Write After Read (WAR)n Caused by an anti-dependence and

the re-use of the name “r1”n Instr-J tries to write operand (r1)

before Instr-I reads itI: add r4, r1, r3J: add r1, r2, r3K: mul r6, r1, r7

n Write After Write (WAW)n Caused by an output dependence

and the re-use of the name “r1”n Instr-J tries to write operand (r1)

before Instr-I writes itI: sub r1, r4, r3J: add r1, r2, r3K: mul r6, r1, r7

75

v Solutions for Data Hazards§ Stalling§ Forwarding:

• connect new value directly to next stage

§ Speculation (w/ HW) or reordering (w/ compiler and/or HW)

Happens in concurrent execution or OoO

Control Hazards

n A control hazard is when we need to find the destination of a branch, and can’t fetch any new instructions until we know that destination

76

10: beq r1,r3,36

14: and r2,r3,r5

18: or r6,r1,r7

22: add r8,r1,r9

36: xor r10,r1,r11

Reg ALU

DMemIfetch Reg

Ifetch

Ifetch

Reg ALU

DMem Reg

Reg ALU

DMemIfetch Reg

Ifetch

Reg ALU

DMem Reg

Reg ALU

DMem Reg

Five Branch Hazard Alternatives

#1: Stall until branch direction is clear

#2: Predict Branch Not Taken§ Execute successor instructions in

sequence§ “Squash” instructions in pipeline if branch

actually taken§ Advantage of late pipeline state update§ 47% MIPS branches not taken on average§ PC+4 already calculated, so use it to get

next instruction

#3: Predict Branch Taken§ 53% MIPS branches taken on average§ But haven’t calculated branch target

address in MIPS§ MIPS still incurs 1 cycle branch penalty§ Other machines: branch target known

before outcome

#4: Execute Both Paths

#5: Delayed Branchn Define branch to take place AFTER a

following instruction

branch instructionsequential successor1sequential successor2........sequential successorn

branch target if taken

n 1 slot delay allows proper decision and branch target address in 5 stage pipeline

77

Pipelining

n Pipelined designn One stage per cyclen Overlap instructions

n Cost: pipeline registers

n To reduce stallsn Forwarding paths for data

dependenciesn Predict-not-taken branches for

control dependenciesn Instruction & data caches to

reduce memory stalls

78

D-Cache

PC

I-Cache

Decoder

Register File

ALU

Instruction Fetch

Decode & Read

operands

Execute

Memory access

Writeback

Pipelining and ILP

n Higher clock frequency (lower CCT): Deeper pipelinesn Decompose pipeline stages into smaller stages - Overlap more instructions

n Lower CPIbase: Wider pipelinesn Insert multiple instruction in parallel in the pipeline

n Lower CPIstall:n Diversified pipelines for different functional unitsn Out-of-order execution

n Balance conflicting goalsn Deeper & Wider pipelines è more control hazardsn Branch prediction (speculation)

79

Deep Pipelining

n Idea: break up instruction into N stagesn Ideal CCT = 1/N compared to non-pipelinedn So let’s use a large N

n Other motivations for deep pipelinesn Not all basic operations have the same

latencyn Integer ALU, FP ALU, cache access

n Difficult to fit them in one pipeline stagen CCT must be large enough to fit the longest

onen Break some of them into multiple pipeline

stagesn e.g. data cache access in 2 stages, FP add

in 2 stage, FP mul in 3 stage…

80

Fetch 1

Fetch 2

Decode

Read Registers

ALU

Memory 1

Memory 2

Write Registers

Limits to Pipeline Depth

n Each pipeline stage introduces some overhead (O)

n Delay of pipeline registers n Inequalities in work per stage

n Cannot break up work into stages at arbitrary points n Clock skew

n Clocks to different registers may not be perfectly aligned

n If original CCT was T, with N stages CCT is T/N+O

n If N→¥, speedup = T / (T/N+O) → T/On Assuming that IC and CPI stay constant

n Eventually overhead dominates and leads to diminishing returns

81

T

T/N T/N OO

Pipelining Limits

n High clock frequency, but modest performance gainsn Due to memory latency and branch delays

n Power consumptions increases dangerously!

82

[Grochowski,Intel, 1997]Pentium4

Pentium3

Wide or Superscalar Pipelines

n Idea: operate on N instructions each cycle

n Parallelism at the instruction leveln CPIbase = 1/N

n Options (from simpler to harder)n One integer and one floating-point

instructionn Any N=2 instructionsn Any N=4 instructionsn Any N=? Instructions

n What are the limits here?

83

Fetch 1

DecodeRead Registers

ALU

Memory

Write Registers

Diversified Pipelines

n Idea: decouple the execution portion of the pipeline for different instructions

n Separate pipelines for simple integer, integer multiply, FP, load/store

n Advantage: avoids unnecessary stallsn e.g. slow FP instruction does not block

independent integer instructions

n Disadvantagesn WAW hazardsn Imprecise (out-of-order) exceptions

84

Fetch 1

DecodeRead Registers

Write Registers

Int Add

Int Mult

Int Mult

FPU

FPU

FPU

FPU

Memory

Memory

Memory

ILP Architectures

n Computer Architecture: is a contract (instruction format and the interpretation of the bits that constitute an instruction) between the class of programs that are written for the architecture and the set of processor implementations of that architecture

n In ILP Architectures: + information embedded in the program pertaining to available parallelism between instructions and operations in the program

85

Sequential Architecture and Superscalar Processors

n Program contains no explicit information regarding dependencies that exist between instructions

n Dependencies between instructions must be determined by the hardware

n It is only necessary to determine dependencies with sequentially preceding instructions that have been issued but not yet completed

n Compiler may re-order instructions to facilitate the hardware’s task of extracting parallelism

86

Scalar, Superscalar, Deep pipeline

n Scalar Processor: One instruction pass through in each cycle

n Superscalar Processor – More than one instruction pass through in each cycle

n For m-way Superscalar, effective CPI is 1/m of the pipeline

87

3-way pipelined Superscalar

Superscalar Performance

n Performance Spectrum?n What if all instructions were dependent?

n Speedup = 0, Superscalar buys us nothing

n What if all instructions were independent?n Speedup = N where N = superscalarity

n Again key is typical program behaviorn Some parallelism exists

88

Simplified View of an OoO Superscalar Processor

89

Fetch Unit

Decode / Rename

Retire

I-Cache

Dispatch

Branch Prediction

Int Int Float Float L/S L/S

D-Cache

Instruction (fetch) buffer

Reservation stations

Write buffer

In O

rder

Issu

eIn

Ord

erCo

mm

it

4 5 6 7 9 10

3

1

7

9

10 11

13

15

Reorder Buffer

Issue width

17

8

2

5

8

12

14

16

Out

of O

rder

Exec

utio

n

•Read registers or•Assign register tag•Advance instructions to reservation stations

•Monitor register tag•Receive data being forwarded• Issue when all operands ready

Independence Architecture and VLIW Processors

n By knowing which operations are independent, the hardware needs no further checking to determine which instructions can be issued in the same cycle

n The set of independent operations >> the set of dependent operations

n Only a subset of independent operations are specified

n The compiler may additionally specify on which functional unit and in which cycle an operation is executed

n The hardware needs to make no run-time decisions

90

VLIW Processors

n Operation vs. Instructionn Operation: is an unit of computation (add, load, branch = instruction in

sequential arch.)n Instruction: set of operations that are intended to be issued simultaneously

n Compiler decides which operation to go to each instruction (scheduling)

n All operations that are supposed to begin at the same time are packaged into a single VLIW instruction

IF ID EX M WBEX M WBEX M WB

IF ID EX M WBEX M WBEX M WB

91

VLIW: Very Long Instruction Word

n Compiler schedules parallel execution

n Multiple parallel operations packed into one long instruction word

n Compiler must avoid data hazards (no interlocks)

Two Integer Units,Single Cycle Latency

Two Load/Store Units,Three Cycle Latency

Two Floating-Point Units,Four Cycle Latency

Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2

92

VLIW Strengths

n In hardware it is very simple: n consisting of a collection of function units (adders, multipliers,

branch units, etc.) connected by a bus, plus some registers and caches

n More silicon goes to the actual processing (rather than being spent on branch prediction, for example),

n It should run fast, as the only limit is the latency of the function units themselves

n Programming a VLIW chip is very much like writing microcode

93

VLIW Limitations

n The need for a powerful compiler,

n Increased code size arising from aggressive scheduling policies,

n Larger memory bandwidth and register-file bandwidth,

n Limitations due to binary compatibility across implementations

94

VLIW past & future

n Decline of VLIWs for general purpose systems:n Couldn’t be integrated in a single chipn Binary compatibility between implementations

n Rediscovery of VLIW in embbededn No more integrability issuesn Binary incompatibility not relevant (for DSP not CPU)n Advanteges of VLIW:

n Simplified hardwaren optimize ad-hoc the architecture to achieve ILP

95

Summary: Superscalar vs. VLIW

Superscalar VLIW

Additional info required in the program

None

Minimally, a partial list of independences. A complete specification of when and where each operation to be executed

Dependences analysis Performed by HW Performed by compiler

Independences analysis Performed by HW Performed by compiler

Scheduling Performed by HW Performed by compiler

Role of compilerRearranges the code to make the analysis and scheduling HW more successful

Replaces virtually all the analysis and scheduling HW

96

ILP Open Problems

n Pipelined scheduling : Optimized scheduling of pipelined behavioral descriptions

n Two simple type of pipelining (structural and functional)

n Controller cost : Most scheduling algorithms do not consider the controller costs which is directly dependent on the controller style used during scheduling

n Area constraints : The resource constrained algorithms could have better interaction between scheduling and floorplanning

n Realism: n Scheduling realistic design descriptions that contain several special language

constructsn Using more realistic libraries and cost functionsn Scheduling algorithms must also be expanded to incorporate different target

architectures97

Summary: Limits to ILP

n Doubling issue rates above today’s 3-6 instructions per clock probably requires processor to:

n Issue 3-4 data-memory accesses per cycle, n Resolve 2-3 branches per cycle, n Rename and access over 20 registers per cycle, and n Fetch 12-24 instructions per cycle.

n Complexity of implementing these capabilities is likely to mean sacrifices in maximum clock rate

n Widest-issue processor tends to be slowest in terms of clock raten Also consider ROI in terms of area and power

98

Summary: Limits to ILP (cont’d)

n Most ways to increase performance also boost power consumption

n Key question is energy efficiency: does a method increase power consumption faster than it boosts performance?

n Multiple-issue techniques are energy inefficient:n Incurs logic overhead that grows faster than issue raten Growing gap between peak issue rates and sustained

performancen Number of transistors switching = f (peak issue rate);

performance = f (sustained rate); growing gap between peak and sustained performance Þ Increasing energy per unit of performance

99

Evolved Solution or Alternatives

n MT (Multithreaded) approachn More tightly coupled than MP n Decentralized multithreaded architectures

n Hardware for inter-thread synchronization and communicationn Multiscalar (U of Wisconsin), Superthreading (U of Minnesota)

n Centralized multithreaded architecturesn Share pipelines among multiple threadsn TERA, SMT (throughput-oriented)n Trace Processor, DMT (performance-oriented)

n MP (Multiprocessor) approachn Decentralize all resourcesn Multiprocessing on a single chip

n Communicate through shared-memory: Stanford Hydran Communicate through messages: MIT RAW

100