ECE5917SoC Architecture: MP SoC – Part 1
Tae Hee Han: [email protected]
Semiconductor Systems Engineering
Sungkyunkwan University
Outline
n Overview
n Parallelismn Data-Level Parallelismn Instruction-Level Parallelismn Thread-Level Parallelismn Processor-Level Parallelism
n Multi-core
2
Overview
3
Where Are We Headed?
4
0.01
0.1
1
10
100
1000
10000
100000
1000000
1970 1975 1980 1985 1990 1995 2000 2005 2010
MIP
S
Speculative, OOO
Era of Instruction
LevelParallelism
Superscalar
486386
2868086
Multithread
Era of Thread &Processor
LevelParallelism
Special Purpose
HW
Multithread, Multi-core
Single-chip CPU Era (~ 2004)
SIMD-extension
CPU-GPU Fusion
Pipelining
ø Time frame is popularity based. (Not based on first appearance)
Where Are We Headed? (Intel – AMD Architecture Transition)
5
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
IntelDesktop
&Server
IntelMobile
AMDDesktop
&Server
130nm Tualatin
180nm K7
130nm Banias 90nm Dothan 65nm
Yonah
130nm K7 130nm K8 90nm K8 65nm K8 65nm K10 45nm K10 32nm Bulldozer
System-level Integration EraSystem-level
Integration EraMulti-Core EraSingle-Core Era
single-corecrisis
single-corecrisis
CELLShockCELLShock
P6 (Pentium M)
K10 (K8L) BulldozerK8K7
1-core 2-core 4-core >= 6-core
2-core1-core
130nm Northwood/Gallatin 90nm Prescott / Smithfield
65nm Cedar Mill /
Presler65nm Merom 45nm
Penryn 45nm Nehalem 32nm Westmere
32nm Sandy Bridge
22nm Ivy Bridge
NetBurst Core Nehalem Sandy Bridge
>= 6-core4-core4-core (MCM)2-core2-core (MCM)1-core
P6 (Pentium III)
Processor Architectures: Flynn’s Classification n SISD: Single Instruction, Single Data stream
n Uniprocessor
n SIMD: Single Instruction, Multiple Data streamsn Same instruction executed by multiple processing unitsn e.g.: multimedia processors, vector architectures
n MISD: Multiple Instruction, Single Data streamn Successive functional units operate on the same stream of
datan Rarely found in general-purpose commercial designs
n MIMD: Multiple Instruction, Multiple Data streamsn Each processor has its own instruction and data streamsn Most popular form of parallel processing
n Single-user: high-performance for one applicationn Multiprogrammed: running many tasks simultaneously (e.g.,
servers)
6
Instruction Pool
Data
Poo
l
PU
SISD
Instruction Pool
Data
Poo
l
SIMD
PU
PU
PU
PU
System-level Integration (Chuck Moore, AMD at MICRO 2008)
n Single-chip CPU Era: 1986 –2004n Extreme focus on single-threaded performancen Multi-issue, out-of-order execution plus moderate cache hierarchy
n Chip Multiprocessor (CMP) Era: 2004 –2010n Early: Hasty integration of multiple cores into same chip/packagen Mid-life: Address some of the HW scalability and (memory) interference issuesn Current: Homogeneous CPUs plus moderate system-level functionality
n System-level Integration Era: ~2010 onwardn Integration of substantial system-level functionalityn Heterogeneous processors and acceleratorsn Introspective control systems for managing on-chip resources & events
7
Challenges!: Chuck Moore (AMD, 2011)
8
Time
Inte
grat
ion
(log
sc
ale)
?
We are here
• DFM• Variability• Reliability• Wire delay
Moore’s Law
Issue Width
IPC
We are here
ILP complexity Wall
Time
Pow
er B
udge
t (TD
P)
We are here
• Server: power = $$• DT: eliminate fans• Mobile: battery
Power Wall
Cache Size
Perfo
rman
ce
We are here
Locality
Time
Freq
uenc
y
We are here
Frequency Wall
Time
Sing
le-t
hrea
d Pe
rf. ?
We are here
Single Thread Performance
Three Walls to Serial Performance
n Memory Wall
n Instruction Level Parallelism (ILP) Wall
n Power Wall
9
Source: excellent article, “The Many-Core Inflection Point for Mass Market Computer Systems”, by John L. Manferdelli, Microsoft Corporation
http://www.ctwatch.org/quarterly/articles/2007/02/the-many-core-inflection-point-for-mass-market-computer-systems/
Recall: Memory Wall
n Processor – Memory(DRAM) Performance Gap!n DRAM: A 1-cycle access in 1980 takes 100s of cycles in 2010n Registers: fast but small and expensiven We want: fast, large, and cheap memory
10
Recall: Typical Memory Hierarchy
11
Larger,Slower,Cheaper
External Secondary Storage(External HDD, Tape, CD/DVD, Cloud Server)
Smaller,Faster,Costlier
Register(F/F)
L1 $ (SRAM)
L2 – L3 $ (SRAM)
Data Storage (HDD)
Main memory (DRAM)
» 105
Performance GapFlash Cache
SSD
Random Access (Read) Latency Type Access time Capacity Managed by
Register 1 cycle » 500~1,000B Compiler
L1 cache » 3~4 cycles » 64KB HW
L2 cache » 10~30 cycles » 256KB HW
L3 cache » 30~60 cycles » 2~8MB HW
Main Memory
» 100~300 cycles
512~4GB (mobile) /
4~16GB (PC)OS
Flash storage » 5K~10K cycles
8~32GB (mobile) /
128~512GB (PC)OS/Operator
HDD » 10M~20M cycles > 1TB (PC) OS/Operator
Recall: How to alleviate the Memory Wall Problem
n Hiding/Reducing the memory access latency
n Holistic approachn Caches, Local memory, DRAM stacking,
HW/SW prefetching, Data locality optimization, Memory controller, SMT
n Increasing the bandwidthn Latency helps BW, but not vice versa
n Reducing the number of memory accesses
n keeping as much reusable data on cache and local memory as possible
12
ILP Wall
n Duplicate hardware speculatively executes future instructions before the results of current instructions are known, while providing hardware safeguards to prevent the errors that might be caused by out of order execution
n Branches must be “guessed” to decide what instructions to execute simultaneously
n If you guess wrong, you throw away this part of the result
n Data dependencies may prevent successive instructions from executing in parallel, even if there are no branches
13
1. e = a + b 2. f = c + d 3. g = e ´ f
Power Wall
14
• Intel 80386 consumed ~ 2 W• 3.3 GHz Intel Core i7 consumes 130 W• Heat must be dissipated from 1.5 x 1.5 cm chip• This is the limit of what can be cooled by air
Active powerStandby power
v Memory Wall
v ILP Wall
v Power Wall
n Moore’s Lawn Transistor density increases
every 18~24 months
n CMOS Powern Ptotal = V2 f C a + V Ileakage
n Drastic increase in leakage current and decrease in noise margin prevent the voltage scaling around 1V
Limitations in Processor Performance Not only Battery, but also Heat!
Power Wall
n Power dissipation in clocked digital devices is proportional to the clock frequency, imposing a natural limit on clock rates
n Significant increase in clock speed without heroic (and expensive) cooling is not possible à Chips would simply melt
n Clock speed increased by a factor of 1,000 during last two decadesn The ability of manufacturers to dissipate heat is limited though…n Look back at the last five years, the clock rates are pretty much flat
n You could bank on Materials Science (MS) breakthroughsn The MS Engineers have usually delivered, can they keep doing it ??
15
Pollack’s Rule: Trade-offs
16
CMOS Process Technology (mm)
Area(Lead / Compaction)
0
1
2
3
4
1.5 1 0.7 0.5 0.35 0.18
improvement (X)
Performance(Lead / Compaction)
Pollack’s RulePollack’s Rule:"performance increase due to m-architecture advances is roughly proportional to [the] square root of [the] increase in complexity“
Implications (in the same technology)• New m-Arch consumes about 2-3x die area
of the last m-Arch, but provides 1.5-1.7x performance
Multi-core
n Put multiple CPU’s on the same die
n Why is this better than multiple dies?n Smaller, Cheapern Closer, so lower inter-processor latencyn Can share a L2 Cache (complicated)n Less power
n Cost of multi-core:n Complexityn Slower single-thread execution
17
Creating Parallel Processing Programs
n It is difficult to write SW that uses multiple processors to complete one task faster, and the problem gets worse as the number of processors increases
n The first reason is that you must get better performance and efficiency from a parallel processing program on a multiprocessor
n Think an analogy of eight reporters trying to write a single story in hopes of doing the work eight times faster
18
But … (Fortunately)
n With the rise of the Internet and rich multimedia applications, the need for handling independent tasks and huge data increased dramatically
à Task Level Parallelism and Data Level Parallelism
n User computing environment is changing to include many “background” tasks
n Multiprocessors can speed up these types of applications with the help of tighter integration of cores and multithreading
19
Multi-core vs. Manycore
n Multi-core: current trajectoryn Stay with current fastest core designn Replicate every 18 months (2, 4, 8 . . . Etc…)n Advantage: Do not alienate serial workloadn Example: AMD X2 (2 core), Intel Core2 Quad (4 cores), AMD Barcelona (4 cores)
n Manycore: converging in this directionn Simplify cores (shorter pipelines, lower clock frequencies, in-order processing)n Start at 100s of cores and replicate every 18 monthsn Advantage: easier verification, defect tolerance, highest compute/surface-area, best power
efficiencyn Examples: Cell SPE (8 cores), Nvidia G80 (128 cores), Intel Polaris (80 cores), Cisco/Tensilica
Metro (188 cores)
n Convergence: Ultimately toward Manycoren Manycore if we can figure out how to program it! n Hedge: Heterogeneous Multi-core
20
Manycore System: CPU or GPU
n CPUn Large cache and sophisticated flow control minimize latency for arbitrary
memory access for serial process
n GPUn Simple flow control and limited cache, more transistors for computing in
paralleln High arithmetic intensity hides memory latency
21
DRAM
Cache
ALUControl
ALU
ALU
ALU
DRAM
CPU GPU Source: NVIDIA
How Small is “Small”
n Power5 (Server)n 389mm2
n 120W@1900MHz
n Intel Core2 sc (laptop)n 130mm2
n 15W@1000MHz
n ARM Cortex A8 (automobiles)n 5mm2
n 0.8W@800MHz
n Tensilica DP (cell phones / printers)n 0.8mm2
n 0.09W@600MHz
n Tensilica Xtensa (Cisco router)n 0.32mm2 for 3!n 0.05W@600MHz
Intel Core2
ARM
TensilicaDP
Xtensa x 3
Power 5
Each core operates at 1/3 to 1/10th efficiency of largest chip, but you can pack 100x more cores onto a chip and consume 1/20 the power
22
More Concurrency: Design for Low Power
n Cubic power improvement with lower clock rate due to V2F
n Slower clock rates enable use of simpler cores
n Simpler cores use less area (lower leakage) and reduce cost
n Tailor design to application to reduce waste
Intel Core2
ARM
TensilicaDP
Xtensa x 3
Power 5
This is how iPhones and MP3 players are designed to maximize battery life and minimize cost
23
Tension between Concurrency and Power Efficiency
n Highly concurrent systems can be more power efficient n Dynamic power is proportional to V2fCn Build systems with even higher concurrency?
n However, many algorithms are unable to exploit massive concurrency yet
n If higher concurrency cannot deliver faster time to solution, then power efficiency benefit wasted
n So we should build fewer/faster processors?
24
Path to Power Efficiency: Reducing Waste in Computing
n Examine methodology of low-power embedded computing marketn optimized for low power, low cost, and high computational
efficiency
“Years of research in low-power embedded computing have shown only one design technique to reduce power: reduce waste.”
¾ Mark Horowitz, Stanford University & Rambus Inc.
n Sources of Wasten Wasted transistors (surface area)n Wasted computation (useless work/speculation/stalls)n Wasted bandwidth (data movement)n Designing for serial performance
25
What’s Next?
Source: Jack Dongarra, Intl. Supercomputing Conf. (ISC) 2008
26
Memory
+ 3D Stacked Memory
Many Floating-Point Cores
All Large CoreMixed Large
andSmall Core
All Small Core
Many Small Cores
Different Classes of ChipsHomeGames / GraphicsBusiness Scientific
Different Classes of ChipsHomeGames / GraphicsBusiness Scientific
The question is not whether this will happen but whether we are ready
Intel Single-chip Cloud Computer (Dec. 2009)
27
Parallelism- Introduction
28
Little’s Law
29
n Throughput (T) = Number-in-flight (N) / Latency (L)n Example: 4 floating-point registers, 8 cycles per floating-point opn Little’s Law à ½ issue per cycle
WBIssue Execution
Basic Performance Quantities
n Latency:n Every operation requires time to
executen i.e. instruction, memory or
network latency
n Bandwidth:n # of (parallel) operations
completed per cyclen i.e. # of FPUs, DRAM, Network,
etc…
n Concurrency:n Total # of operations in flight
n Little’s Law relates these three:n Concurrency = Latency ´
Bandwidth - or -n Effective Throughput = Expressed
Concurrency / Latency
n Concurrency must be filled with parallel operations
n Can’t exceed peak throughput with superfluous concurrency
n Each channel has a maximum (limited) throughput
30
Performance Optimization: Contending Forces
n Contending forces of device efficiency and usage/traffic
n Often boils down to several key challenges:n Management of data/task localityn Management of data dependenciesn Management of communicationn Management of variable and dynamic parallelism
31
Improve throughput
Reduce Volume of
Data
Restructure to satisfy
Little’s Law
Implementation & Algorithmic Optimization
Classes of Parallelism and Parallel Architectures (1/2)
n Basically two kinds of parallelism in applications:n Data-level parallelism (DLP)
n There are many data items that can be operated on at the same timen Task-level parallelism (TLP)
n Tasks of work are created that can operate independently and largely in parallel
32
Source: Computer Architecture 5th ed.: A Quantitative Approach(Morgan Kaufmann, by Hennessy & Patterson, 2011)
Classes of Parallelism and Parallel Architectures (2/2)
n Computer HW in turn can exploit these two kinds of application parallelism in four major ways:
n Instruction-level parallelismn Exploits DLP at modest levels with compiler help using ideas like pipelining and at
medium levels using ideas like speculative executionn Vector architectures and GPUs
n Exploits DLP by applying a single instruction to a collection of data in parallel (SIMD)
n Thread-level parallelismn Exploits either DLP or TLP in a tightly coupled hardware model that allows for
interaction among parallel threadsn Request-level parallelism
n Exploits parallelism among largely decoupled tasks specified by the programmer or the OS
33
Source: Computer Architecture 5th ed.: A Quantitative Approach(Morgan Kaufmann, by Hennessy & Patterson, 2011)
Uses of Parallelism
n “Horizontal” parallelism for throughput
n More units working in parallel
n “Vertical” parallelism for latency hiding
n Pipelining: keep units busy when waiting for dependencies of resource, data, and control
34
A B C D
Throughput
A B C D
Late
ncyA B C
A B
A
Program Execution Time
n Latency metric: program execution time in seconds
n Your system architecture can affect all of themn CPI (Cycles per instructions): memory latency, IO latency, …n CCT (clock freq.): cache org., power budget, …n IC (Instruction count): OS overhead, compiler choice …
35
Independent?
CCTCPIIC Cycle
SecondsnInstructio
CyclesProgram
nsInstructio
CycleSeconds
ProgramCycles
ProgramSecondsCPUtime
××=
××=
×==
Architecture Methods for Performance Enhancement
n Powerful instructionsn MD-technique
n Multiple data operands per operation: SIMD (Vector, Sub-word SIMD Extension)
n MO-techniquen Multiple operations per instruction: Sophisticated ISA (e.g. CISC-like),
VLIW
n Pipelining
n Multiple instruction issuen Single stream: Superscalar n Multiple streams
n Multithreading, Multi-core
36
Powerful Instructions – MD Technique
n MD-techniquen Multiple data operands per operationn SIMD: Single Instruction Multiple Data
37
Vector instruction:
for (i=0, i++, i<64)c[i] = a[i] + 5*b[i];
or
c = a + 5*b
Assembly:
Set vl,64Ldv v1,0(r2)Mulvi v2,v1,5Ldv v1,0(r1)Addv v3,v1,v2Stv v3,0(r3)
Powerful Instructions – MD Technique
n SIMD computing
n All PEs (Processing Elements) execute same operation
n Typical mesh or hypercube connectivity
n Exploit data locality of e.g. image processing applications
n Dense encoding (few instruction bits needed)
38
SIMD Execution Method
time
Instruction 1
Instruction 2
Instruction 3
Instruction n
PE1 PE2 PEn
Powerful Instructions – MD Technique
n Sub-word parallelismn SIMD on restricted scale for
Multimedia instructionsn short vectors added to existing ISAs for
microprocessorsn Examples: Intel MMX/SSE/AVX, ARM
NEON, AMD 3Dnow
39
´ ´ ´ ´
Powerful Instructions – MO Technique
n MO-technique: multiple operations per instruction
n Two options:n CISC (Complex Instruction Set Computer)n VLIW (Very Long Instruction Word)
sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5)
FU 1 FU 2 FU 3 FU 4field
instruction bnez r5, 13
FU 5
VLIW instruction example
40
Parallelism- Data Level Parallelism
41
Recall: Flynn’s Classification of Processor Architecture n SISD: Single Instruction, Single Data stream
n Uniprocessor
n SIMD: Single Instruction, Multiple Data streamsn Same instruction executed by multiple processing unitsn e.g.: multimedia processors, vector architectures
n MISD: Multiple Instruction, Single Data streamn Successive functional units operate on the same stream of
datan Rarely found in general-purpose commercial designs
n MIMD: Multiple Instruction, Multiple Data streamsn Each processor has its own instruction and data streamsn Most popular form of parallel processing
n Single-user: high-performance for one applicationn Multiprogrammed: running many tasks simultaneously (e.g.,
servers)
42
Instruction Pool
Data
Poo
l
PU
SISD
Instruction Pool
Data
Poo
l
SIMD
PU
PU
PU
PU
Data-level Parallelism
n Data parallelism focuses on distributing the data across different parallel computing nodes, which is usually found in:
n Multimedia Computingn Identical ops on streams or arrays of sound samples,
pixels, video framesn Scientific Computing
n Weather forecasting, car-crash simulation, biological modeling
43
DLP Kernel dominate many Computational Workloads
44
DLP and Throughput Computing
45Source: Chuck Moore (AMD, 2011)
Data Parallelism & Loop Level Parallelism (LLP)
n Data Parallelism:n Similar independent/parallel computations on different elements of arrays
that usually result in independent (or parallel) loop iterations
n A common way to increase parallelism among instructions is to exploit data parallelism among independent iterations of a loop: exploit Loop Level Parallelism (LLP)
n By unrolling the loop either statically by the compiler, or dynamically by hardware, which increases the size of the basic block present
n This resulting larger basic block provides more instructions that can be scheduled or re-ordered by the compiler/hardware to eliminate more stall cycles
46
for (i=1; i<=1000; i=i+1;) x[i] = x[i] + y[i];
4 vector instructions:Load Vector XLoad Vector YAdd Vector X, X, YStore Vector X
LVLVADDVSV
Resurgence of DLP
n Convergence of application demands and technology constraints drives architecture choice
n New applications, such as graphics, machine vision, speech recognition, machine learning, etc. all require large numerical computations that are often trivially data parallel
n SIMD-based architectures (Vector-SIMD, subword-SIMD, SIMT/GPUs) are most efficient way to execute these algorithms
47
SIMD Classifications
n Vector architectures
n SIMD extensions (sub-word SIMD)n E.g) Intel - MMX: Multimedia Extensions (1996), SSE: Streaming
SIMD Extensions (1999), AVX: Advanced Vector Extension (2010)
n Graphics Processing Units (GPUs)
48
Vector Architectures
n Basic idea:n Read sets of data elements into “vector registers”n Operate on those registersn Disperse the results back into memory
n Registers are controlled by compilern Register files act as compiler controlled buffersn Used to hide memory latencyn Leverage memory bandwidth
n Vector loads/stores deeply pipelinedn Pay for memory latency once per vector ld/st!
n Regular loads/storesn Pay for memory latency for each vector element
49
+
r1
r3
r2
SCALAR(1 operation)
add r3, r1, r2
VECTOR(N operations)
vadd.vv v3, v1, v2
+
Rs1
Rd
Rs2
+
Rs1
Rd
Rs2
+
Rs1
Rd
Rs2
+
Rs1
Rd
Rs2
+
Rs1
Rd
Rs2
+
v1
v3
v2
Vector length
Vector Programming Model
50
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic InstructionsADDV v3, v1, v2
v3
v2v1
Scalar Registers
r0
r15Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLRVector Length Register
v1Vector Load & Store
InstructionsLV v1, r1, r2
Base, r1 Stride, r2Memory
Vector Register
Multiple Datapaths
n Vector elements interleaved across lanesn Example: V[0, 4, 8, …] on Lane 1, V[1, 5, 9,…] on Lane 2, etc.
n Compute for multiple elements per cycle n Example: Lane 1 computes on V[0] and V[4] in one cycle
n Modular, scalable designn No inter-lane communication needed for most vector instructions
51
Vector Processors (I)
n A vector is a one-dimensional array of numbers
n Many scientific/commercial programs use vectorsfor (i = 0; i<=49; i++)C[i] = (A[i] + B[i]) / 2;
n A vector processor is one whose instructions operate on vectors rather than scalar (single data) values
n Basic requirementsn Need to load/store vectors à vector registers (contain vectors)n Need to operate on vectors of different lengths à vector length register
(VLEN)n Elements of a vector might be stored apart from each other in memory à
vector stride register (VSTR)n Stride: distance between two elements of a vector
52
Vector Processors (II)
n A vector instruction performs an operation on each element in consecutive cycles
n Vector functional units are pipelinedn Each pipeline stage operates on a different data element
n Vector instructions allow deeper pipelinesn No intra-vector dependencies à no hardware interlocking within
a vectorn No control flow within a vectorn Known stride allows prefetching of vectors into cache/memory
53
Vector Processor Pros
n No dependencies within a vector n Pipelining, parallelization work welln Can have very deep pipelines, no dependencies!
n Each instruction generates a lot of work n Reduces instruction fetch bandwidth
n Highly regular memory access pattern n Interleaving multiple banks for higher memory bandwidthn Prefetching
n No need to explicitly code loopsn Fewer branches in the instruction sequence
54
Vector Processor Cons
n Still requires a traditional scalar unit (integer and FP) for the non-vector operations
n Difficult to maintain precise interrupts (can’t rollback all the individual operations already completed)
n Compiler or programmer has to vectorize programs
n Not very efficient for small vector sizes
n Not suitable/efficient for many different classes of applications
n Requires a specialized, high-bandwidth, memory systemn Usually built around heavily banked memory with data
interleaving55
Vector Processor Limitations
n Performance of a vector instruction depends on the length of the operand vectors
n Initiation raten Rate at which individual operations can start in a functional unitn For fully pipelined units this is one operation per cycle
n Start-up time (latency)n Time it takes to produce the first element of the resultn Depends on how deep the pipeline of the functional units aren Especially large for load/store unit
56
Multimedia SIMD Extensions
n Key ideas:n Media applications operate on data types narrower than the
native word sizen Video & Graphics systems use 8 bits per primary colorn Audio samples use 8-16 bitsn No memories associated with ALU’s, but a pool of relatively wide (64 to
256 bits) registers that store several narrower operandsn E.g) 256-bit adder: 16 simultaneous operations on 16 bits, 32 simultaneous
operations on 8 bits
n No direct communication between ALU’s, but via registers and with special shuffling/permutation instructions
n Not co-processors or supercomputers, but tightly integrated into CPU pipeline
57
Multimedia SIMD Extensions
n Meant for programmers to utilize
n Not for compilers to generaten Recent x86 compilers
n Capable for FP intensive apps
n Why is it popular? n Costs little to add to the standard arithmetic unitn Easy to implementn Need smaller memory bandwidth than vectorn Separate data transfers aligned in memory
n Vector: single instruction, 64 memory accesses, page fault in the middle of the vector likely!
n Use much smaller register spacen Fewer operandsn No need for sophisticated mechanisms of vector architecture
58
Multimedia Extensions (aka SIMD extensions)
n Very short vectors added to existing ISAs for microprocessors
n Use existing wide-bit register split into small-bit registersn Lincoln Labs TX-2 from 1957 had 36b datapath split into 2´18b or 4´9bn Newer designs have wider registers
n 128b for PowerPC Altivec, Intel SSE2/3/4n 256b for Intel AVX
n Single instruction operates on all elements within a register
59
16b 16b 16b 16b32b 32b
64b
8b 8b 8b 8b 8b 8b 8b 8b
16b 16b 16b 16b
16b 16b 16b 16b
16b 16b 16b 16b
+ + + +4´16b adds
SIMD Multimedia Extensions like SSE-4
n At the core of multimedia extensionsn SIMD parallelismn Variable-sized data fields:
n Vector length = register width / type size
Sixteen 8-bit Operands
Eight 16-bit Operands
Four 32-bit Operands
V31
...
V0V1V2V3V4V5
WIDE UNIT
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7
0 1 2 3
60
Multimedia Extensions versus Vectors
n Limited instruction set:n no vector length controln no strided load/store or scatter/gathern unit-stride loads must be aligned to 64/128-bit boundary
n Limited vector register length:n requires superscalar dispatch to keep multiply/add/load units busyn loop unrolling to hide latencies increases register pressure
n Trend towards fuller vector support in microprocessorsn Better support for misaligned memory accessesn Support of double-precision (64-bit floating-point)n New Intel AVX spec (announced April 2008), 256b vector registers
(expandable up to 1024b)
61
Parallelism- Instruction Level Parallelism
62
ILP?
n Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously
n The potential overlap among instructions is called instruction level parallelismn There are two approaches to instruction level parallelism:
n Dynamic approach where mainly hardware locates the parallelism à Superscalarn Static approach that largely relies on software to locate parallelism à VLIW (Very
Long Instruction Word)
n How much ILP exists in programs is very application specificn In certain fields, such as graphics and scientific computing the amount can
be very largen However, workloads such as cryptography exhibit much less parallelism
63
ILP vs. PLP
n ILP (Instruction-Level-Parallelism)n Overlap individual machine operations (add, mul, load…) so that
they execute in parallel
n PLP (Processor-Level Parallelism)n Having separate processors getting separate chunks of the
program ( processors programmed to do so)
64
Micro-architectural Techniques for ILP
n Instruction pipelining
n Superscalar or VLIWn Multiple execution units are used to execute multiple
instructions in parallel
n Out-of-Order executionn Note that this technique is independent of both pipelining and
superscalarn Register renaming is used to enable out-of-order execution
n Speculative executionn Execution of complete instructions or parts of instructions before
being certain whether this execution should take placen Branch prediction is used with speculative execution
65
Micro-architectural Techniques for ILP
n Modern processor techniquesn Deep pipelinesn Superscalar issuen Out-of-order, speculative
executionn Branch predictionn Register renaming, dataflow
order
n Execution flown In order, speculative fetchn Out of order executen In order commit
n Using reorder buffer for precise exceptions
66
Fetch Unit
Decode / Rename
Retire
I-Cache
Dispatch
Branch Prediction
Int Int Float Float L/S L/S
D-Cache
Instruction (fetch) buffer
Reservation stations
Reorder buffer
Write buffer
In O
rder
In O
rder
Out
of O
rder
ILP (Parallel Instruction Execution) Constraints
67
ILP Constraints
Structural Dependence(Resource Contention)
Code Dependences(Sequential Semantics of the Program)
Control Dependences Data Dependences
(RAW) True DependencesStorage Conflicts
(not in In-Order Processors)
(WAR) Anti-Dependences (WAW) Output Dependences
Types of Dependencies
n Structural Dependence (Structural Hazard) – HW perspective
n Code Dependence – SW (Program) perspectiven Data dependence (Data Hazard)
n Data True dependencen Name dependencies
n Output dependencen Anti-dependence
n Control Dependence (Control Hazard)
68
Note) H/W terminology Hazards, S/W terminology Dependencies
Visualizing Pipelining
69
Reg Reg
ALU DMemIfetch
Reg Reg
ALU DMemIfetch
Reg Reg
ALU DMemIfetch
Instr.
Order
Time (clock cycles)
Reg Reg
ALU DMemIfetch
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Pipelining
n Overlaps execution of instructions by exploiting “Instruction Level Parallelism”
n Recall that
n Pipelining became universal technique in 1985
n Performance Enhancementn Reduce the number of instructions per program (IC)n Reduce the number of cycles per instruction (CPI)n Reduce the number of seconds per cycle (CCT)
n Pipelining can reduce CCT & (effective) CPI
70
CCTCPIIC Cycle
SecondsnInstructio
CyclesProgram
nsInstructioCycle
SecondsProgram
CyclesProgramSeconds(Latency)CPUtime
××=
××=×==
Given ISA, it fully depends on SW (Compiler, Programmer)
Mostly depends on HW organization & implementation technology under system requirements
Pipelining is not quite that easy!
n Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle
n Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away)
n Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock)
n Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)
71
Note) H/W terminology Hazards, S/W terminology Dependencies
Reg Reg
ALU DMemIfetch
Reg Reg
ALU DMemIfetch
Reg Reg
ALU DMemIfetch
Reg Reg
ALU DMemIfetch
Reg Reg
ALU DMemIfetch
Structural Hazards
72
When two or moredifferent instructionswant to use samehardware resource insame cycle
e.g., MEM uses thesame memory portas IF as shown in thisslide.
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Structural Hazards
n Structural hazards are reduced with these rules:
n Each instruction uses a resource at most once
n Always use the resource in the same pipeline stage
n Use the resource for one cycle only
n ISAs designed with this in mind
n Sometimes very complex to do this
n Heavily depends on programs and hardware resources
n Some common Structural Hazards:
n Memory access conflictn Floating point - Since many
floating point instructions require many cycles, it’s easy for them to interfere with each other
n Starting up more of one type of instruction than there are resources
73
Data Hazards
74
The use of the result of the ADD instruction in the next three instructions causes a hazard, since the register is not written until after those instructions read it.
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Time (clock cycles)
IF ID/RF EX MEM WB
Data Hazards
n Read After Write (RAW)n Caused by a dependence, need for
communicationn Instr-J tries to read operand before
Instr-I writes itI : add r1, r2, r3J : sub r4, r1, 43
n Write After Read (WAR)n Caused by an anti-dependence and
the re-use of the name “r1”n Instr-J tries to write operand (r1)
before Instr-I reads itI: add r4, r1, r3J: add r1, r2, r3K: mul r6, r1, r7
n Write After Write (WAW)n Caused by an output dependence
and the re-use of the name “r1”n Instr-J tries to write operand (r1)
before Instr-I writes itI: sub r1, r4, r3J: add r1, r2, r3K: mul r6, r1, r7
75
v Solutions for Data Hazards§ Stalling§ Forwarding:
• connect new value directly to next stage
§ Speculation (w/ HW) or reordering (w/ compiler and/or HW)
Happens in concurrent execution or OoO
Control Hazards
n A control hazard is when we need to find the destination of a branch, and can’t fetch any new instructions until we know that destination
76
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg ALU
DMemIfetch Reg
Ifetch
Ifetch
Reg ALU
DMem Reg
Reg ALU
DMemIfetch Reg
Ifetch
Reg ALU
DMem Reg
Reg ALU
DMem Reg
Five Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken§ Execute successor instructions in
sequence§ “Squash” instructions in pipeline if branch
actually taken§ Advantage of late pipeline state update§ 47% MIPS branches not taken on average§ PC+4 already calculated, so use it to get
next instruction
#3: Predict Branch Taken§ 53% MIPS branches taken on average§ But haven’t calculated branch target
address in MIPS§ MIPS still incurs 1 cycle branch penalty§ Other machines: branch target known
before outcome
#4: Execute Both Paths
#5: Delayed Branchn Define branch to take place AFTER a
following instruction
branch instructionsequential successor1sequential successor2........sequential successorn
branch target if taken
n 1 slot delay allows proper decision and branch target address in 5 stage pipeline
77
Pipelining
n Pipelined designn One stage per cyclen Overlap instructions
n Cost: pipeline registers
n To reduce stallsn Forwarding paths for data
dependenciesn Predict-not-taken branches for
control dependenciesn Instruction & data caches to
reduce memory stalls
78
D-Cache
PC
I-Cache
Decoder
Register File
ALU
Instruction Fetch
Decode & Read
operands
Execute
Memory access
Writeback
Pipelining and ILP
n Higher clock frequency (lower CCT): Deeper pipelinesn Decompose pipeline stages into smaller stages - Overlap more instructions
n Lower CPIbase: Wider pipelinesn Insert multiple instruction in parallel in the pipeline
n Lower CPIstall:n Diversified pipelines for different functional unitsn Out-of-order execution
n Balance conflicting goalsn Deeper & Wider pipelines è more control hazardsn Branch prediction (speculation)
79
Deep Pipelining
n Idea: break up instruction into N stagesn Ideal CCT = 1/N compared to non-pipelinedn So let’s use a large N
n Other motivations for deep pipelinesn Not all basic operations have the same
latencyn Integer ALU, FP ALU, cache access
n Difficult to fit them in one pipeline stagen CCT must be large enough to fit the longest
onen Break some of them into multiple pipeline
stagesn e.g. data cache access in 2 stages, FP add
in 2 stage, FP mul in 3 stage…
80
Fetch 1
Fetch 2
Decode
Read Registers
ALU
Memory 1
Memory 2
Write Registers
Limits to Pipeline Depth
n Each pipeline stage introduces some overhead (O)
n Delay of pipeline registers n Inequalities in work per stage
n Cannot break up work into stages at arbitrary points n Clock skew
n Clocks to different registers may not be perfectly aligned
n If original CCT was T, with N stages CCT is T/N+O
n If N→¥, speedup = T / (T/N+O) → T/On Assuming that IC and CPI stay constant
n Eventually overhead dominates and leads to diminishing returns
81
T
T/N T/N OO
Pipelining Limits
n High clock frequency, but modest performance gainsn Due to memory latency and branch delays
n Power consumptions increases dangerously!
82
[Grochowski,Intel, 1997]Pentium4
Pentium3
Wide or Superscalar Pipelines
n Idea: operate on N instructions each cycle
n Parallelism at the instruction leveln CPIbase = 1/N
n Options (from simpler to harder)n One integer and one floating-point
instructionn Any N=2 instructionsn Any N=4 instructionsn Any N=? Instructions
n What are the limits here?
83
Fetch 1
DecodeRead Registers
ALU
Memory
Write Registers
Diversified Pipelines
n Idea: decouple the execution portion of the pipeline for different instructions
n Separate pipelines for simple integer, integer multiply, FP, load/store
n Advantage: avoids unnecessary stallsn e.g. slow FP instruction does not block
independent integer instructions
n Disadvantagesn WAW hazardsn Imprecise (out-of-order) exceptions
84
Fetch 1
DecodeRead Registers
Write Registers
Int Add
Int Mult
Int Mult
FPU
FPU
FPU
FPU
Memory
Memory
Memory
ILP Architectures
n Computer Architecture: is a contract (instruction format and the interpretation of the bits that constitute an instruction) between the class of programs that are written for the architecture and the set of processor implementations of that architecture
n In ILP Architectures: + information embedded in the program pertaining to available parallelism between instructions and operations in the program
85
Sequential Architecture and Superscalar Processors
n Program contains no explicit information regarding dependencies that exist between instructions
n Dependencies between instructions must be determined by the hardware
n It is only necessary to determine dependencies with sequentially preceding instructions that have been issued but not yet completed
n Compiler may re-order instructions to facilitate the hardware’s task of extracting parallelism
86
Scalar, Superscalar, Deep pipeline
n Scalar Processor: One instruction pass through in each cycle
n Superscalar Processor – More than one instruction pass through in each cycle
n For m-way Superscalar, effective CPI is 1/m of the pipeline
87
3-way pipelined Superscalar
Superscalar Performance
n Performance Spectrum?n What if all instructions were dependent?
n Speedup = 0, Superscalar buys us nothing
n What if all instructions were independent?n Speedup = N where N = superscalarity
n Again key is typical program behaviorn Some parallelism exists
88
Simplified View of an OoO Superscalar Processor
89
Fetch Unit
Decode / Rename
Retire
I-Cache
Dispatch
Branch Prediction
Int Int Float Float L/S L/S
D-Cache
Instruction (fetch) buffer
Reservation stations
Write buffer
In O
rder
Issu
eIn
Ord
erCo
mm
it
4 5 6 7 9 10
3
1
7
9
10 11
13
15
Reorder Buffer
Issue width
17
8
2
5
8
12
14
16
Out
of O
rder
Exec
utio
n
•Read registers or•Assign register tag•Advance instructions to reservation stations
•Monitor register tag•Receive data being forwarded• Issue when all operands ready
Independence Architecture and VLIW Processors
n By knowing which operations are independent, the hardware needs no further checking to determine which instructions can be issued in the same cycle
n The set of independent operations >> the set of dependent operations
n Only a subset of independent operations are specified
n The compiler may additionally specify on which functional unit and in which cycle an operation is executed
n The hardware needs to make no run-time decisions
90
VLIW Processors
n Operation vs. Instructionn Operation: is an unit of computation (add, load, branch = instruction in
sequential arch.)n Instruction: set of operations that are intended to be issued simultaneously
n Compiler decides which operation to go to each instruction (scheduling)
n All operations that are supposed to begin at the same time are packaged into a single VLIW instruction
IF ID EX M WBEX M WBEX M WB
IF ID EX M WBEX M WBEX M WB
91
VLIW: Very Long Instruction Word
n Compiler schedules parallel execution
n Multiple parallel operations packed into one long instruction word
n Compiler must avoid data hazards (no interlocks)
Two Integer Units,Single Cycle Latency
Two Load/Store Units,Three Cycle Latency
Two Floating-Point Units,Four Cycle Latency
Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2
92
VLIW Strengths
n In hardware it is very simple: n consisting of a collection of function units (adders, multipliers,
branch units, etc.) connected by a bus, plus some registers and caches
n More silicon goes to the actual processing (rather than being spent on branch prediction, for example),
n It should run fast, as the only limit is the latency of the function units themselves
n Programming a VLIW chip is very much like writing microcode
93
VLIW Limitations
n The need for a powerful compiler,
n Increased code size arising from aggressive scheduling policies,
n Larger memory bandwidth and register-file bandwidth,
n Limitations due to binary compatibility across implementations
94
VLIW past & future
n Decline of VLIWs for general purpose systems:n Couldn’t be integrated in a single chipn Binary compatibility between implementations
n Rediscovery of VLIW in embbededn No more integrability issuesn Binary incompatibility not relevant (for DSP not CPU)n Advanteges of VLIW:
n Simplified hardwaren optimize ad-hoc the architecture to achieve ILP
95
Summary: Superscalar vs. VLIW
Superscalar VLIW
Additional info required in the program
None
Minimally, a partial list of independences. A complete specification of when and where each operation to be executed
Dependences analysis Performed by HW Performed by compiler
Independences analysis Performed by HW Performed by compiler
Scheduling Performed by HW Performed by compiler
Role of compilerRearranges the code to make the analysis and scheduling HW more successful
Replaces virtually all the analysis and scheduling HW
96
ILP Open Problems
n Pipelined scheduling : Optimized scheduling of pipelined behavioral descriptions
n Two simple type of pipelining (structural and functional)
n Controller cost : Most scheduling algorithms do not consider the controller costs which is directly dependent on the controller style used during scheduling
n Area constraints : The resource constrained algorithms could have better interaction between scheduling and floorplanning
n Realism: n Scheduling realistic design descriptions that contain several special language
constructsn Using more realistic libraries and cost functionsn Scheduling algorithms must also be expanded to incorporate different target
architectures97
Summary: Limits to ILP
n Doubling issue rates above today’s 3-6 instructions per clock probably requires processor to:
n Issue 3-4 data-memory accesses per cycle, n Resolve 2-3 branches per cycle, n Rename and access over 20 registers per cycle, and n Fetch 12-24 instructions per cycle.
n Complexity of implementing these capabilities is likely to mean sacrifices in maximum clock rate
n Widest-issue processor tends to be slowest in terms of clock raten Also consider ROI in terms of area and power
98
Summary: Limits to ILP (cont’d)
n Most ways to increase performance also boost power consumption
n Key question is energy efficiency: does a method increase power consumption faster than it boosts performance?
n Multiple-issue techniques are energy inefficient:n Incurs logic overhead that grows faster than issue raten Growing gap between peak issue rates and sustained
performancen Number of transistors switching = f (peak issue rate);
performance = f (sustained rate); growing gap between peak and sustained performance Þ Increasing energy per unit of performance
99
Evolved Solution or Alternatives
n MT (Multithreaded) approachn More tightly coupled than MP n Decentralized multithreaded architectures
n Hardware for inter-thread synchronization and communicationn Multiscalar (U of Wisconsin), Superthreading (U of Minnesota)
n Centralized multithreaded architecturesn Share pipelines among multiple threadsn TERA, SMT (throughput-oriented)n Trace Processor, DMT (performance-oriented)
n MP (Multiprocessor) approachn Decentralize all resourcesn Multiprocessing on a single chip
n Communicate through shared-memory: Stanford Hydran Communicate through messages: MIT RAW
100