Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Energy-EfficientParallel Computer Architecture
Christopher Batten
Computer Systems LaboratorySchool of Electrical and Computer Engineering
Cornell University
Fall 2014
• Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns
Motivating Trends in Computer Architecture
Transistors(Thousands)
Frequency(MHz)
TypicalPower (W)
MIPSR2K
IntelP4
DECAlpha 21264
Data collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten1975 1980 1985 1990 1995 2000 2005 2010 2015
100
101
102
103
104
105
106
SPECintPerformance
107
Numberof Cores
Intel 48-CorePrototype
AMD 4-CoreOpteron
Parallelism?
? Trend 1Power & energy
constrain allsystems
Trend 2Transition to
multicoreprocessors
Trend 3Inevitable endof Moore’s law
Cornell University Christopher Batten 2 / 19
• Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns
Performance (Tasks per Second)
En
erg
y E
ffic
ien
cy (
Ta
sks p
er
Joule
)
SimpleProcessor
Design PowerConstraint
High-PerformanceArchitectures
EmbeddedArchitectures
DesignPerformanceConstraint
Flexibility vs
. Spe
cializat
ion
CustomASIC
Less FlexibleAccelerator
More FlexibleAccelerator
Cornell University Christopher Batten 3 / 19
• Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns
Fine-Grain Heterogeneous Architectures
L2 Cache Bank
On-Chip Router
L2 Cache Bank
On-Chip Router
Latency-Optimized Tile: few large out-of-order cores
Data Memory
Instruction Memory
CPLD
ST ST
L2 Cache Bank
On-Chip Router
L2 Cache Bank
On-Chip Router
L1I$
GPPL1 I$
L1 D$
OoO GPP
L1 I$
L1 D$
OoO GPPL1D$
L1I$
GPP
L1D$
L1I$
GPP
L1D$
L1I$
GPP
L1D$
L1I$
GPP
L1D$
L1I$
GPP
L1D$
μT0
μT4
μT1
μT5
L1 I$
CP SIMT Issue Unit
SIMT Memory Unit
L1 D$
μT2
μT6
μT3
μT7
Throughput-Optimized Tile: many small in-order cores
Data-Parallel-Optimized Tile:flexible data-parallel accelerator
Application-Optimized Tile:instruction set specialization;flexible algorithm ordata-structure accelerators
Cornell University Christopher Batten 4 / 19
• Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns
Projects Within the Batten Research Group
GPGPUMicroarchitecture
XLOOPS: Specializationfor Loop Patterns
XPC: Explicit-Parallel-Call Architectures
instrA
instrB
pcall n, func1
...
psync func1:
instrC
pcall func2
...
pret
CtrlProc
L1 Data Cache
SIMT Lanes
SIMT Issue Unit
SIMT Mem Unit
Polymorphic HardwareSpecialization
L1I$
PolyASU
PolyASU
PolyDSU
PolyDSU
PolyDSU
Control Crossbar
Memory Crossbar
CP
L1 Data Cache
Reconfigurable PowerDistribution Networks
Control
Complexity EffectiveOn-Chip Networks
Hardware Accelerationfor Dynamic Languages
Scheme Python
JIT Compiler
Python-BasedHardware Modeling
class Reg (Model):
def __init__( s ):
s.in = InPort(32)
s.out = OutPort(32)
def elaborate( s ):
@s.posedge_clk
def seq_logic():
s.out.next = s.in
Cornell University Christopher Batten 5 / 19
• Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns
Vertically Integrated Research Methodology
Our research involves reconsidering all aspects of the computing stackincluding applications, programming frameworks, compiler optimizations,runtime systems, instruction set design, microarchitecture design, VLSI
implementation, and hardware design methodologies
CrossCompiler
FunctionalSimulator
Binary
C++ Applications
C++ FunctionalSpecification
Cycle-LevelSimulator
C++ Cycle-LevelModel
Layout
Verilog RTL Model
RTLSimulator
Gate-Level Model
Gate-LevelSimulator
Switching Activity
PowerAnalysis
SynthesisPlace&Route
Key Metrics: Cycle Count, Cycle Time, Area, Energy
Experimenting with full-chiplayout, FPGA prototypes, andtest chips is a key part of our
research methodology
Cornell University Christopher Batten 6 / 19
Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •
XLOOPS: Architectural Specialization forInter-Iteration Loop Dependence Patterns
Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu,Zhiru Zhang, and Christopher Batten
47th ACM/IEEE Int’l Symp. on Microarchitecture (MICRO)Cambridge, UK, Dec. 2014
Cornell University Christopher Batten 7 / 19
Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •
Inter-Iteration Loop Dependence Patterns
inst0inst1inst2inst3...branch
Iteration0
inst0inst1inst2inst3...branch
Iteration1
inst0inst1inst2inst3...branch
Iteration2
inst0inst1inst2inst3...branch
Iteration3
inst0inst1inst2inst3...branch
Iterationn-1
Explicit loop specialization (XLOOPS) elegantly encodes inter-iterationloop dependence patterns in the instruction set and enables
traditional, specialized, and adaptive execution.
Cornell University Christopher Batten 8 / 19
Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •
XLOOPS ISA: Unordered Concurrent
for ( i=0; i<N; i++ )
C[i] = A[i] * B[i]
loop:
lw r2, 0(rA)
lw r3, 0(rB)
mul r4, r2, r3
sw r4, 0(rC)
addiu.xi rA, 4
addiu.xi rB, 4
addiu.xi rC, 4
addiu r1, r1, 1
xloop.uc r1, rN, loop
Iteration 0
inst0inst1inst2inst3...xloop.uc
Iteration 1
inst0inst1inst2inst3...xloop.uc
Iteration 2
inst0inst1inst2inst3...xloop.uc
Iteration 3
inst0inst1inst2inst3...xloop.uc
Cornell University Christopher Batten 9 / 19
Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •
XLOOPS ISA: Ordered-Through-Registers
for ( X=0, i=0; i<N; i++ )
X += A[i]
B[i] = X
loop:
lw r2, 0(rA)
addu rX, r2, rX
sw rX, 0(rB)
addiu.xi rA, 4
addiu.xi rB, 4
addiu r1, r1, 1
xloop.or r1, rN, loop
Iteration 0
inst0inst1inst2inst3...xloop.or
Iteration 1
inst0inst1inst2inst3...xloop.or
Iteration 2
inst0inst1inst2inst3...xloop.or
Iteration 3
inst0inst1inst2inst3...xloop.or
Cornell University Christopher Batten 10 / 19
Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •
XLOOPS ISA: Ordered-Through-Memory
for ( i=K; i<N; i++ )
A[i] = A[i] * A[i-K]
# r1 = rK
# r3 = rA + 4*rK
loop:
lw r4, 0(r3)
lw r5, 0(rA)
mul r6, r4, r5
sw r6, 0(r3)
addiu.xi r3, 4
addiu.xi rA, 4
addiu r1, r1, 1
xloop.om r1, rN, loop
Iteration 0
inst0inst1inst2inst3...xloop.om
Iteration 1
inst0inst1inst2inst3...xloop.om
Iteration 2
inst0inst1inst2inst3...xloop.om
Iteration 3
inst0inst1inst2inst3...xloop.om
Cornell University Christopher Batten 11 / 19
Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •
XLOOPS ISA: Unordered Atomic
for ( i=0; i<N; i++ )
B[ A[i] ]++
D[ C[i] ]++
loop:
lw r6, 0(rA)
lw r7, 0(r6)
addiu r7, r7, 1
sw r7, 0(r6)
addiu.xi rA, rA, 4
...
addiu r1, r1, 1
xloop.ua r1, rN, loop
Iteration 0
inst0inst1inst2inst3...xloop.ua
Iteration 1
inst0inst1inst2inst3...xloop.ua
Iteration 2
inst0inst1inst2inst3...xloop.ua
Iteration 3
inst0inst1inst2inst3...xloop.ua
Cornell University Christopher Batten 12 / 19
Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •
XLOOPS ISA: Dynamic Bound
Iteration 0
inst0inst1inst2inst3...xloop.db
Iteration 1
inst0inst1inst2inst3...xloop.db
Iteration 2
inst0inst1inst2inst3...xloop.db
Iteration 3
inst0inst1inst2inst3...xloop.db
Iteration 4
inst0inst1inst2inst3...xloop.db
Iteration 5
inst0inst1inst2inst3...xloop.db
Cornell University Christopher Batten 13 / 19
Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •
XLOOPS Compiler
Kernel implementing Floyd-Warshall shortest path algorithm
for ( int k = 0; k < n; k++ )
#pragma xloops ordered
for ( int i = 0; i < n; i++ )
#pragma xloops unordered
for ( int j = 0; j < n; j++ )
path[i][j] = min( path[i][j], path[i][k] + path[k][j] );
Kernel implementing greedy algorithm for maximal matching
#pragma xloops ordered
for ( int i = 0; i < num_edges; i++ )
int v = edges[i].v; int u = edges[i].u;
if ( vertices[v] < 0 && vertices[u] < 0 )
vertices[v] = u; vertices[u] = v; out[k++] = i;
Cornell University Christopher Batten 14 / 19
Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •
XLOOPS Microarchitecture
GPR RF32 × 32b
2r2w
GPP
LLFU
D$ Request/Response Crossbar
L1 I$ 16 KB
L2 Request and Response Crossbars
L1 D$ 16 KB
SLFU
Lane3
Lane1
Lane RF24 × 32b
2r2w
Inst Buf128×
Lane RF24 × 32b
2r2w
Inst Buf128×
Lane RF24 × 32b
2r2w
Inst Buf128×
Lane0
SLFU SLFU SLFU
IDQ
Lane Management Unit
IDQ IDQ
CIB 8×CIB 8×CIB 8×
LSQ16×
LSQ16×
LSQ16×
DBN
Lane Management Unit I Elegant hardware/softwareabstraction enables efficienttraditional execution ongeneral-purpose processor
I Loop-pattern specialization unitenables specialized execution forimproved performance andefficiency
I Adaptive execution uses twoprofiling phases to adaptivelydetermine best location to runloop
Cornell University Christopher Batten 15 / 19
Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •
XLOOPS Cycle-Level Performance Results
rgb2
cmyk
-uc
sgem
m-u
c
ssea
rch-
uc
sym
m-u
c
vite
rbi-u
c
war
-uc
adpc
m-o
r
cova
r-or
dith
er-o
r
kmea
ns-o
r
sha-
or
sym
m-o
r
dynp
rog-
om
knn-
om
ksac
k-sm
-om
ksac
k-lg-o
m
war
-om
mm
-orm
sten
cil-o
rm
btre
e-ua
hsor
t-ua
huffm
an-u
a
rsor
t-ua
bfs-
uc-d
b
qsor
t-uc-
db
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Sp
ee
du
p N
orm
aliz
ed
to
S
imp
le R
ISC
Co
re
OOO 2-Way OOO 4-Way OOO 2-Way with LPSU
xloop.uc xloop.or xloop.{om,orm} xloop.ua xloop.*.db
I XLOOPS vs. Simple Core : Always higher performanceI XLOOPS vs. OOO 2-way : Generally competitive or higher performanceI XLOOPS vs. OOO 4-way : Mixed results
Cornell University Christopher Batten 16 / 19
Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •
XLOOPS Cycle-Level Energy-Efficiency Results
XLOOPS vs.Simple RISC Core
XLOOPS vs.OOO 2-Way
XLOOPS vs.OOO 4-Way
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0Normalized Performance
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Norm
aliz
ed E
nerg
y E
ffic
iency
0.5 1.0 1.5 2.0 2.5 3.0Normalized Performance
0.5 1.0 1.5 2.0 2.5Normalized Performance
I XLOOPS vs. Simple Core : Competitive energy efficiencyI XLOOPS vs. OOO 2-way : Always more energy efficientI XLOOPS vs. OOO 4-way : Always more energy efficient
Cornell University Christopher Batten 17 / 19
Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •
DCache16KB SRAM for Cache Lines
DCacheTags
ICacheTags
ICache16KB SRAM for Cache Lines
L0Instr
Buffer
L0Instr
Buffer
L0Instr
Buffer
L0Instr
Buffer
Loop PatternSpecialization Unit
ScalarProcessor
32b IEEEFloating Point Unit
32b IntegerMul/Div Unit
VLSIImplementation
I TSMC 40 nmstandard-cell-basedimplementation
I RISC scalarprocessor with4-lane LPSU
I Supports xloop.uc
I Implemented inTSMC 40 nm usingASIC CAD toolflow
I ≈40% extra areacompared to simpleRISC processor
Cornell University Christopher Batten 18 / 19
Batten Research Group Overview XLOOPS: Specialization for Loop Patterns
Take-Away Points
I We envision future computing systems usingfine-grain heterogeneous architectures with a mixof tiles optimized for latency, throughput, classesof applications, or even a specific application
I We are using a vertically integrated researchmethodology to explore a variety of projects thatwill hopefully enable such fine-grainheterogeneous architectures
Cornell University Christopher Batten 19 / 19