Upload
barbra-matthews
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
1
Energy Efficient Computing through Compiler Assisted Dynamic Specialization
Venkatraman Govindaraju
Advisor: Karthikeyan Sankaralingam
(Defense: 7/29/2014)
2
Why energy efficiency?
1985 1990 1995 2000 2005 20100.001
0.01
0.1
1
10
100
1000
10000
100000 Transistors (in 100K)Power(W)Performance(GOPS)Efficiency(GOPS/W)
Moore’s Law is still valid
Limited by heat
Because of diminishing returns
Performance stagnates
Simplified Graph from “The Free Lunch Is Over”. Herb Sutter. In DDJ, March 2005
We must improve energy efficiency to scale performance
Year
FabScalar OpenSPARC
Where energy consumed?
Actual execution consumes only a fraction of energy
3
Reduce overhead energy to improve overall energy efficiency
Data is from “Power balanced pipelines” Sartori et al. in HPCA 2012
4
How to get efficiency?
Use accelerators or specialization
Efficiency
Gen
eral
ity,
Com
pile
r Effe
ctive
ness
General Purpose processor(GPP)
SIMD
ASIC
Flexible as GPP but with ASIC
efficiency?
5
DySER: Compiler Assisted Hardware Specialization
Efficiency
DesignComplexity
Generality Efficiency
Use specialized hardware for hot regions
Generality Reconfigurable at run-time Use encodings generated at
compile-time
Design Complexity Decoupled Access/Execute Use original core for
uncommon task
6
Evolution of DySER
Efficiency
Com
pile
r Effe
ctive
ness General Purpose
processor
SIMD: SSE
ASIC
DySER
DySER + DLP
DySER + DLP+ Slicer
Exploits DLP, and Vectorization for high efficiency
DSL programming[IEEE Micro, 2012]
AEPDG, new IR to model DySER
Auto compiles directly from C/C++ to DySER
[PACT 2013]
Dynamically specialized datapath
DSL programming[HPCA 2011]
7
What’s New?Preliminary Exam (8/12) Defense (7/14)
Architecture Basic DySER ISA, Vector DySER ISA (Prelim)
Vector DySER ISAISA for irregular workloads
Compiler Preliminary DesignPartial Implementation
Complete DesignSource code released
Evaluation Used high level pipeline models - SPEC INT, PARSEC
Accurate Simulator models - SPEC INT, PARSEC - Throughput Kernels - Parboil - Database
Publications Architecture (HPCA2011)Prototype (HPCA2012)
DySER+DLP (IEEE Micro 2012)Compiler (PACT 2013)Integration (HotChips 2012)Modeling (Micro 2014 – In Submission)
Outline
Introduction
DySER: Architecture
Intermediate Representation:Access/Execute PDG
Slicer: Compiler
Evaluation and Results
Conclusion8
DySER Overview
9
DySER
• Circuit-switched array of functional units• Integrated to processor pipeline• Dynamically creates specialized datapath
Fetch Decode Execute Memory WriteBack
D$
I$Register
File
Decode ExecUnits
DySER Datapath
10
DySER Configuration
Use same network for configuration bits
Configure once – reuse many times11
DySER Execution Model:Decoupled Access/Execute Model
Memory access instructions execute in processor pipeline Address Calculation,
Loads, and Stores Configure DySER Send Data to DySER Recv Data from DySER Loop control
Computation executes in DySER
12Processor DySER
Config
________________________
____________
x
-
++
+
x
-
++
+
________________________
JMP LOOP
Execution Example
13
FU
S S
S
S S
S
FUS S
FUS S
FUS
FUS
FUS
FUS
FUS
FUS
Input FIFO
OutputFIFO
IP0 IP1 IP2 IP3
OP0 OP1 OP2 OP3
Config Path
DySER Program//Vector Dot ProductDyINIT (0xABCD)DyINIT (0xEF00)…SUM=[0,0];for(int i =0; i < LEN; i+=2) { DySend_Vec(SUM, IP0); DyLoad_Vec(a[i:i+1], IP1); DyLoad_Vec(b[i:i+1], IP2); DyRecv_Vec(OP2, SUM);}sum= accum(SUM);//(last iteration here)return sum;
Execution Example
14
S S
S
S S
S
S S
S S
×S
+S
S
S
S
S
Input FIFO
OutputFIFO
IP0 IP1 IP2 IP3
OP0 OP1 OP2 OP3
DySER Program//Vector Dot ProductDyINIT (0xABCD)DyINIT (0xEF00)…SUM=[0,0];for(int i =0; i < LEN; i+=2) { DySend_Vec(SUM, IP0); DyLoad_Vec(a[i:i+1], IP1); DyLoad_Vec(b[i:i+1], IP2); DyRecv_Vec(OP2, SUM);}sum= accum(SUM);//(last iteration here)return sum;
15
Why does it work?
Applications execute in phases
Applications follow 90-10 rule 10% of code-region contributes to 90% of run time
Creating specialization for such code-regions amortizes the overheads
Where does performance come from?
Removing instructions from main pipeline Less use of Instruction Queue, ROB, Register File Effectively larger instruction window
Decoupled Execution Concurrency between main processor and DySER Many FUs -> High Potential ILP
Benefits of Vectorization Fewer memory access instructions Explicit pipelining of DySER
16
17
Energy Savings?
Eliminates per instruction overheads No fetch, decode etc., No expensive register reads etc.,
High performance itself leads to energy savings No additional power-hungry structures
Outline
Introduction
DySER: Architecture
Intermediate Representation:Access/Execute PDG
Slicer: Compiler
Evaluation and Results
Conclusion18
19
Compiler Intermediate Representation
Makes it easier to optimize for target architecture
A suitable IR should Model the architecture, accurately if possible Capture the dependencies between the
operations Generate code for the architecture with ease
DySER Architecture:Configurable Datapath
20
Configure switches and functional units to create different datapath
Can specialize datapath For ILP For DLP
Allows acceleration of variety of computation patterns
×S S
S
S S
S
S S
×S
+S
×S
+S
S S S
+S
Mul-Accumulate
-S S
S
S S
S
&S S
S
>S
S
S
S S
?:S
+S
Sum of Abs. Differences
×S S
S
S S
S
+S S
×S
+S
×S
+S
S S S S
3x3 Convolution
21
Compiler IR for DySER: Modeling Configurable Datapath
Graph based Nodes represent the operations/instruction Edges represent dependence between the operations
Easier to map computation to DySER
for (i = 0; i < N; ++i) C[i] += A[i] * B[i]
LDLD
×
ST
+2
×
S S
S
S
S
+S S
+S
×S
in1 in2
×
+2
out
DySER Architecture:Control Flow Mapping
22
S S
S
S S
S
S S
>S
+S
S
-S
S S S
φS
Predication Predicates the output A metabit in datapath
propagates the validity of the data
“Select” function unit (PHI functions) Selects valid input and
forwards as its output
in0 in1
Out
Pred.
V V
PHI
V
0 1
1in1
Native control flow mapping allows accelerating code with arbitrary control-flow
23
Compiler IR for DySER: Modeling Control Flow mapping
Special edges to represent control dependence
Special node to model PHI instruction
for (i = 0; i < N; ++i): if b[i] < 0: a = b[i] + 5; else: a = b[i] - 5; b[i] = a;
-+
<
LD
ST
φ
b+i
DySER Architecture:Decoupled Access/Execute Execution
24
S S
S
S S
S
S S
S S
S
S
S
S
S
S
Input FIFO
OutputFIFO
IP0 IP1 IP2 IP3
OP0 OP1 OP2 OP3
Processor sends data to DySER through its input FIFOs (input ports)
DySER computes in data flow fashion
Processor receives data from DySER through its output FIFOs (output ports)
Allows DySER to consume data in different order than how it is stored
25
Compiler IR for DySER: Modeling Decoupled Access/Execute Execution
Explicitly partitioned into Access and Execute PDG
for (i = 0; i < N; ++i): if b[i] < 0: a = b[i] + 5; else: a = b[i] - 5; b[i] = a;
-+
<
LD
ST
φ
b+i
+ -
<
φ
ExecutePDG
b[i+1]b[i+0]
DySER Architecture: Flexible Vector Interface
struct vec { float x, y, z; float q;}vec A[], B[];float *a = A, *b = B;float dot[];for(int i =0; i < LEN; i+=1) { dot[i]=A[i].x*B[i].x +A[i].y*B[i].y +A[i].z*B[i].z;}
26
× × ×
+
+
a[i] a[i+1] a[i+2]
dot[i]
b[i+2]
S S
S
S S
S
S S
S S
S
S
S
S
S
S
DySER Architecture: Flexible Vector Interface
× × ×
a[0]a[4]
a[1]a[5]
a[2]a[6]
++
How do weget this accesspattern?
Iteration 2Iteration 1
27
struct vec { float x, y, z; float q;}vec A[], B[];float *a = A, *b = B;float dot[];for(int i =0; i < LEN; i+=1) { dot[i]=A[i].x*B[i].x +A[i].y*B[i].y +A[i].z*B[i].z;}
Ports shown only for a[]
DySER Architecture:Flexible Vector Interface
A flexible mechanism to map from contiguous inputs to arbitrary DySER I/Os.
Add a “Vector Port” before FIFOs.
Add a “Vector Map” which tells how data should be transferred.
Data is processed with a state machine when data arrives
04
0 1 32 4 5 76
15
26P0 P1 xP2 P0 P1 xP2
Vector Port:
Vector Port Map:
P0 P1 P2 P3
28
× × ×
Input FIFO
29
DySER Architecture:Flexible Vector Interface
S S
S
S S
S
S S
S
S
S
S
IP0 IP1 IP2 IP3
3210 7654Memory/Vector register
0 1 2 30123
0 12 3
“Vector Port Mapping”Allows accelerating code region with different
memory access patterns (eg. Strided)
Compiler IR for DySER: Modeling Flexible Vector Interface
30
× × ×
+
+
a[i] a[i+1] a[i+2]
out[i]
× × ×
+
+
a[i] a[i+1] a[i+2]
out[i]
a[i+3] a[i+4] a[i+5]
out[i+1]
03
14
25
P0 P1 P2 P3
× × ×
0 1 32 4 5 760
10 1 0
1
10
Vector Port (for a[])
Original AEPDG Unrolled AEPDG Vector Map Generation(Load/Store Coalescing)
x x xx x x xxVector Port Map
P0 x P0x x x xxP0 P1 P0x P1 x xxP0 P1 P0P2 P1 P2 xx
Each edge on the interface knows its order
Compiler IR for DySER: Modeling Flexible Vector Interface
31
× × ×
+
+
a[i] a[i+1] a[i+2]
out[i]
× × ×
+
+
out[i:i+1]
04
15
26
P0 P1 P2 P3
× × ×
0 1 32 4 5 76Vector Port (for a[])
Original AEPDG Unrolled AEPDG Vector Map Generation
x x xx x x xxVector Port Map
P0 x xx P0 x xxP0 P1 xx P0 P1 xxP0 P1 P0P2 P1 P2 xxa[i:i+5]
32
Compiler IR: Access Execute Program Dependence Graph (AEPDG)
A variant of PDG
Nodes represent operations
Edges represent both data and control dependence
Explicitly partitioned into access-PDG and execute-PDG subgraph
Edges between access and execute-PDG augmented with temporal information
-+
<
LD
ST
φ
b+i
+ -
<
φ
Outline
Introduction
DySER: Architecture
Intermediate Representation:Access/Execute PDG
Slicer: Compiler
Evaluation and Results
Conclusion33
Compilation Tasks Identify code-regions/loops
to specialize
Construct AEPDG Access PDG Execute PDG
Perform Vectorization/ Optimizations
Schedule Execute PDG to DySER Access PDG to core 34
Core aaaDySER
Scheduling
VectorizationOptimization
Execute PDG
Access PDG
Identification & Construct AEPDG
Application
Core aaaDySER
Scheduling
VectorizationOptimization
Execute PDG
Access PDG
Region Identification
Application
Region Identification Identify code-regions to
specialize Path Profiling Utilize Loops
Need Single-Entry / Single Exit Region
35
SpecializationRegion
Construct AEPDG Build Program Dependence Graph Separate memory access from
computation. Loads/Stores and all dependent
computation are access.
36
a[i]
×
b[i]
+2
c[i]
a+i b+i c+iAddress Calc:
Loads:
Store:Core aaaDySER
Scheduling
VectorizationOptimization
Execute PDG
Access PDG
Region Identification
Application
Construct AEPDG
37
a[i]
×
b[i]
+2
c[i]
a+i b+i c+iAddress Calc:
Loads:
Store:
a+i b+i c+i
×
+2
Build Program Dependence Graph Separate memory access from
computation. Loads/Stores and all dependent
computation are access.
Core aaaDySER
Scheduling
VectorizationOptimization
Execute PDG
Access PDG
Region Identification
Application
Construct AEPDG
38
a[i]
×
b[i]
+2
c[i]
a+i b+i c+iAddress Calc:
Loads:
Store:
a+i b+i c+i
×
+2
a+i b+i c+i
Build Program Dependence Graph Separate memory access from
computation. Loads/Stores and all dependent
computation are access.
Core aaaDySER
Scheduling
VectorizationOptimization
Execute PDG
Access PDG
Region Identification
Application
Construct AEPDG
39
a[i]
×
b[i]
+2
c[i]
a+i b+i c+iAddress Calc:
Loads:
Store:
ExecuteSubregion
Build Program Dependence Graph Separate memory access from
computation. Loads/Stores and all dependent
computation are access.
Core aaaDySER
Scheduling
VectorizationOptimization
Execute PDG
Access PDG
Region Identification
Application
Vectorization
40
• Similar to SIMD Techniques, loops must have:– Independent Iterations– Must be no Store/Load Aliasing
• Memory Access: No gather/scatter• Perform Loop Control
– Modify trip count/peel scalar loop
a[i]
×
b[i]
+2
c[i]Core aaaDySER
Scheduling
VectorizationOptimization
Execute PDG
Access PDG
Region Identification
Application
Vectorization
41
• Similar to SIMD Techniques, loops must have:– Independent Iterations– Must be no Store/Load Aliasing
• Memory Access: No gather/scatter• Perform Loop Control
– Modify trip count/peel scalar loop
a[i:i+3]
×
b[i:i+3]
+2
c[i:i+3]
Data is pipelinedthrough DySER
Core aaaDySER
Scheduling
VectorizationOptimization
Execute PDG
Access PDG
Region Identification
Application
Scheduling
42
in1
×
in2
+2
out
×
S S
S
S
S
+S S
+S
×S
in1 in2
• Map Execute Subregion to DySER– Sort nodes in data flow order– Greedily place each node to
minimize the total routes
Core aaaDySER
Scheduling
VectorizationOptimization
Execute PDG
Access PDG
Region Identification
Application
Scheduling
43
in1
×
in2
+2
out
×
S S
S
S
S
+S S
+S
×S
in1 in2
×
• Map Execute Subregion to DySER– Sort nodes in data flow order– Greedily place each node to
minimize the total routes
Core aaaDySER
Scheduling
VectorizationOptimization
Execute PDG
Access PDG
Region Identification
Application
Core aaaDySER
Scheduling
Vectorization
Execute Code
Access Code
Region Identification
Application
Scheduling
44
in1
×
in2
+2
out
×
S S
S
S
S
+S S
+S
×S
in1 in2
×
• Map Execute Subregion to DySER– Sort nodes in data flow order– Greedily place each node to
minimize the total routes
Core aaaDySER
Scheduling
Vectorization
Execute Code
Access Code
Region Identification
Application
Scheduling
45
in1
×
in2
+2
out
×
S S
S
S
S
+S S
+S
×S
in1 in2
×
out
+2
• Map Execute Subregion to DySER– Sort nodes in data flow order– Greedily place each node to
minimize the total routes
Core aaaDySER
Scheduling
Vectorization
Execute Code
Access Code
Region Identification
Application
Scheduling
46
• Map Execute Subregion to DySER– Sort nodes in data flow order– Greedily place each node to
minimize the total routes
in1
×
in2
+2
out
×
S S
S
S
S
+S S
+S
×S
in1 in2
×
+2
out
Case Study: Loop Dependence//Needleman Wunschint a[],b[]; //initialize
for(int i =1; i < NCOLS; ++i) { for(int j = 1; j < NROWS; ++j) { a[i][j]=max(a[i-1][j-1]+b[i][j], a[i-1][j], a[i][j-1]) }}Outer Iterations are dependent, too
+max
max
a[i-1][j-1] a[i-1][j]
Use result of previous iteration
a[i][j-1]
a[i][j]
+max
max
a[i-1][j] a[i-1][j+1]
a[i][j+1]
Vectorizable!
47
Dependence Chain
Array a[]
+max
max
+max
max
Outline
Introduction
DySER: Architecture
Intermediate Representation:Access/Execute PDG
Slicer: Compiler
Evaluation and Results
Conclusion48
49
Evaluation Methodology Simulation Framework
Gem5 + DySERsim for performance McPAT for energy
Compiler Implementation Leverages LLVM compilation framework Constructs AEPDG from LLVM-IR Generates binary for x86, SPARC
Benchmarks Throughput Workloads: Intel TPT kernels, Parboil benchmark
suite General purpose Workloads: SPEC-2006, PARSEC Database: Operators and Primitives, Query
50
Evaluation
What is the performance/energy benefits? DLP workloads General Purpose or Irregular workloads
How effective the compiler is?
How effective on database query processing? Both DLP and Irregular in a same application
51
DySER vs. Superscalar: DLP
Series1
0
2
4
6
8
10
Spee
dup
CONV
MERGE
NBODY
RADAR
TrSRCH VR
CUTCPFF
T
KMEANS
LBM
MMM
RI-QSP
MV
STENCIL
TPACFNNW
NEEDLE GM0
102030405060708090
100
Ener
gy R
educ
tion
(%)
Control flow in memory accessMultiple Configurations:
Configuration cost starts to dominate
Indirect memory access, Loop carried dependences
DySER performs on average 3.4x better than baseline with 53% reduction in energy consumption
52
DySER vs. Superscalar: General Purpose
Se-ries1
0%
5%
10%
15%
20%
Spee
dup
ASTAR
BZIP2
H264
HMM
ER
LIBQUANTUM
MCF
BLACKSC
HOLES
FLUID
ANIMATE
FREQM
INE
SWAPTIO
NS
STREAM
CLUST
ERGM
0
10
20
30
Ener
gy R
educ
tion
(%)
DySER provides 8% mean speedup with 11% reduction in energy consumption
Data dependent branches mapped to DySER, which leads less pipeline flushes
Exploits DLP available, but control
dependent stores prevent large gain
53
Where is the efficiency come from?
CONV
MERGE
NBODY
RADAR
TrSRCH VR
CUTCPFF
T
KMEANS
LBM
MMM
RI-Q
NEEDLENNW
SPM
V
SPENCIL
TPACFGM
0123456789
10
DySER IPCCore IPCBaseline IPC
Effec
tive
IPC
DySER emulates a wider issue processor than the baseline processor
54
DySER vs. Superscalar: Summary
TPT Kernels
Parboil SPECINT PARSEC0
1
2
3
4
5Sp
eedu
p
TPT Kernels Parboil SPECINT PARSEC0
20
40
60
80
100
Ener
gy R
educ
tion
(%)
11% 20%
10% 11%
On DLP workloads, DySER provides significant
improvements
On irregular workloads, DySER provides modest improvements
55
Performance: SSE/AVX Vs. DySER
CONV
MERGE
NBODY
RADAR
TREESEARCH VR
CutCP
FFT
KMEANS
LBM
MMM
RI-QSP
MV
STENCIL
TPACFNNW
NEEDLE HM0
1
2
3
4
5
SSEAVXDySER
Spee
dup
Ove
r SSE
13x
DySER bottlenecked by FDIV/FSQRT units
When DLP readily available, both SIMD and DySER perform better
With control intensive code, DySER perform better
DySER performs on average 1.8x better than SSE/AVX
Why DySER is efficient than SIMD? SIMD vectorizes either inside the loop:
Superword-level-parallelism
Or, SIMD vectorizes across loop iterations
DySER can simultaneously vectorize both:
56
SIMD - SLP DySERSIMD – “Do Across”
57
Programmer Optimized vs. Compiler Optimized
CONV
MERGE
NBODY
RADAR
TREESEARCH VR
CutCP
FFT
KMEANS
LBM
MMM
RI-QSP
MV
STENCIL
TPACFNNW
NEEDLE HM0
0.2
0.4
0.6
0.8
1
Compiler
Rela
tive
to p
rogr
amm
er o
ptim
ized
Outer Loop Transformations
Different strategy for Reduction
Constant Table
Lookup
Compiler generated code’s slowdown is only 30%
58
Why Database?
Energy efficient in-core accelerator
Dynamically specializes frequently executing codes
DySER
Energy management is emerging as a primary goal
Query processing with database kernels/Primitives
59
Simplified TPC-H Query 1
SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)
FROMlineitem
WHEREship_date <= XXXX
GROUP BYreturnflag,linestatus
60
Query
SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)
FROMlineitem
WHEREship_date <= XXXX
GROUP BYreturnflag,linestatus
Projection: Highly data
parallel
61
Query
SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)
FROMlineitem
WHEREship_date <= XXXX
GROUP BYreturnflag,linestatus
SCAN:Highly data
parallel
62
Query
SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)
FROMlineitem
WHEREship_date <= XXXX
GROUP BYreturnflag,linestatus
HASH:Data parallel with Control
63
Query
SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)
FROMlineitem
WHEREship_date <= XXXX
GROUP BYreturnflag,linestatus
AGGR:Limited DLP
64
Query Processing Implementations JIT
Whole query is processed in a single loop No intermediate results materialized
Vectorized Query is processed in blocks and data is accessed in
columnar fashion Intermediate results materialized Better for SIMD and exploits cache locality
Hybrid Partition the query to utilize the DLP available without
materializing intermediate results much.
65
Result: TPC-H Query 1
0
1
2
3
4
Scalar SIMD DySER
Spee
dup
JIT
Vectorized
Hybrid
Since no DLP available, SIMD performs poorly.DySER speedups >2.5X by exploiting pipeline parallelism
Hardware/software codesign improves query processing significantly
66
How about design complexity? We (five graduate students) implemented a prototype of
DySER integrated to OpenSPARC Prototype mapped onto a Xilinx Virtex 5 FPGA board Boots Unmodified Ubuntu 7.10 Linux DySER is not in the critical path!
Design, Integration, and Implementation of the DySER hardware Accelerator into OpenSPARC, in HPCA 2012
DySER is indeed a non-intrusive design and easy to integrate to a commercial processor
67
Conclusion
Must rethink & co-design architecture, micro-architecture, and compilers Make energy as a primary constraint Incremental evolution of historical accelerators
has produced diminishing returns
Compiler assisted hardware specialization Provides energy-efficient without the loss of
generality and with low design complexity
68
Publications [GNS PACT 2013] Breaking SIMD Shackles with an Exposed Flexible Micro -
architecture and the Access Execute PDG. In PACT 2013.
[GHNCSSK IEEE Micro 2012] DySER: Unifying Functionality and Parallelism Specialization for Energy Efficient Computing. IEEE Micro Sep/Oct 2012.
[BCFGHMNS HotChips 2012] Design, Integration and Implementation of DySER Hardware Accelerator into OpenSPARC, In HotChips 2012.
[BCFHGNS HPCA 2012] Design, Integration and Implementation of DySER Hardware Accelerator into OpenSPARC, In HPCA 2012.
[NSHGDS ISCA 2011] Sampling + DMR: Practical and Low-overhead Permanent Fault Detection, In ISCA 2011
[GHS HPCA 2011] Dynamically Specialized Datapaths for Energy Efficient Computing. In HPCA 2011
[GDSVM Micro 2008] Toward A Multicore Architecture for Real-time Ray-tracing. In Micro 2008
69
Acknowledgements
Prof. Karu Sankaralingam
Marc de Kruijf
Tony Nowatzki
Lena Olson
DySER Team: Chen-Han, Tony, Chris, Ryan, Zach, Jesse
?70
71
Backup Slides
DySER Datapath
72
• Ready (R) – for flow control (forward)• Credit (C) – for flow control (Backward)• Valid (V) – To support control-flow
C Vdata R
Processor Integration: in-order
DySER interface: FIFO
73
Fetch Decode Execute Memory WriteBack
D$
I$Register
File
Decode ExecUnits
DySER
Out-of-Order Integration
Out-of-order core integration
DySER itself maintains no architectural state
Use buffers to keep the state for speculative execution
74
Small loops Leverage loop properties.
Simply unroll the loop further, “Cloning” the region
×
+
a[0]a[1]a[2]a[3]
b[0]b[1]b[2]b[3]
c[0]c[1]c[2]c[3]
×
+
a[0]a[2]
b[0]b[2]
c[2]c[0]
×
+
a[1]a[3]
b[1]b[3]
c[3]c[1]
Input FIFO
OutputFIFO
ExecuteRegion
Before After
75
Also UsesFlexible I/O
Large Loops: Sub-graph Matching Subgraph Matching
Find Identical computations, split them out
Region Splitting Configure multiple regions, quickly switch between them
76
Large Region Subgraph Matching Region Virtualization
77
Results: Configuration Cost
Programs follow 90/10 rule
blacksch
oles
canneal
fluidanimate
streamclu
ster
bzip2
mcf
h264ref
soplex
sphinx3
Mean0%
10%
20%
Percentage of Code-region Contributing to 90% running time
78
Energy: SSE/AVX Vs. DySER
CONV
MERGE
NBODY
RADAR
TREESEARCH VR
CutCP
FFT
KMEANS
LBM
MMM
RI-QSP
MV
STENCIL
TPACFNNW
NEEDLE GM0
20
40
60
80
100
SSEDySER
% E
nerg
y Re
ducti
on
79
Related Work: Architecture
Reconfigurable system FPGA -- High software cost
Coarse grain reconfigurable system Beret – uses a pre designed set of SEB Micro 2011 C-Cores – Uses a set of Ccores to accelerate
functions ASPLOS 2010 VEAL and CCA – loop accelerator ISCA 2008 Other reconfigurable coprocessor approach
Garp, Tartan, Chimera etc.,
80
Related Work: Beret
An energy efficient coprocessor
No internal control-flow
Uses a set of SEB (subgraph execution block)
81
Related work: Conservation Cores
A set of specialized units that accelerates a whole function.
Slow, no pipelining support
82
Other Publications
Reliability
Sampling + DMR: Practical and Low-overhead Permanent Fault Detection, In ISCA 2011.
Specialized Architecture
Toward A Multicore Architecture for Real-time Ray-tracing, In Micro 2008
83
Database Backup Slides
84
Evaluation Methodology
Implemented optimized versions Baseline: C (no special operations) SSE: manually optimized with compiler intrinsics DySER: manually optimized with DySER instructions AutoDySER: Automatically DySERized using a compiler
Evaluated using a gem5 based simulator X86, an out-of-order CPU model
85
SCAN
Scans a table with equality predicate
High data level parallelism
kernel: //inputs: in_mask:bitvector, col, key//output: out_mask:bitvector
for (i = 0; i < LEN; i += SZ): for (j = 0; j < SZ; ++j): out |= (col[i*SZ+j] == key) << j out_mask[i] = in_mask[i] & out
86
SCAN: Results
If DLP available, both DySER and SIMD performs well
87
Aggregation
Kernel:
Indirect memory access
Represents worst case for DySER
for (i = 0; i < LEN; i++): key = k[i] A[key] += V[i]
88
Why it is hard for DySER
Mostly address calculation
Computation is just one instruction
KA
LD
+
LD
V
LD
+
ST
i
89
Why it is hard for DySER
Mostly address calculation
Computation is just one instruction
KA
LD
+
LD
V
LD
+
ST
i
90
Why it is hard for DySER
Mostly address calculation
Computation is just one instruction
Aliasing prevents loop unrolling
KA
LD
+
LD
V
LD
+
ST
i
91
Why it is hard for DySERKA
LD
+
LD
V
LD
+
ST
i
KA
LD
+
LD
V
LD
+
ST
i+1
may dependence
92
Solution: Alias Checking in DySERKA
LD
+
LD
V
LD
+
ST
i
KA
LD
+
LD
V
LD
+
ST
i+1
==
?
93
Aggregation Results
With out-of-order processor, DySER provides speedup. But with inorder processor, it performs poorly.
94
Database Kernels
DB Kernels with data-level parallelism SCAN SORT
DB Kernels with DLP and control SCAN on RLE HASH STRCMP (Variable length)
Data-level parallelism not readily available Aggregation
95
Results
X86 Inorder
96
Overview
Database Kernels Characterization and Evaluation
Codesigning DySER/DB
Conclusion
97
Codesigning DySER/DB
DySER effectiveness reduces when memory operations dominate Integrate Load and Store with DySER Memory Access Dataflow (MAD) Architecture
Is vectorized query processing a problem for DySER? Compute/memory ratio is low JIT query processing may enable DySER to
exploit pipeline parallelism better
100
A Simple Query
SELECTprice * (1-disc),price * (1-disc) * (1+tax))
FROMlineitem
101
Query 1 Implementation Query Processing:
Vectorized Query Processing:
Out_1 = price * (1-disc)Out_2 = Out_1 * (1+tax)
Inputs Outputs
tmp_out1 = (1-disc)
tmp_out2 = (1+tax)
out1 = price * tmp_out1
out2 = out_1 * tmp_out2
102
Result: Query 1
When fully data-parallel, both SIMD and DySER performs well
103
Slightly Complex Query (TPC-H Q1)
SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)
FROMlineitem
WHEREship_date <= XXXX
GROUP BYreturnflag,linestatus
104
Query
SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)
FROMlineitem
WHEREship_date <= XXXX
GROUP BYreturnflag,linestatus
Projection: Highly data
parallel
105
Query
SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)
FROMlineitem
WHEREship_date <= XXXX
GROUP BYreturnflag,linestatus
SCAN:Highly data
parallel
106
Query
SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)
FROMlineitem
WHEREship_date <= XXXX
GROUP BYreturnflag,linestatus
HASH:Data parallel with Control
107
Query
SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)
FROMlineitem
WHEREship_date <= XXXX
GROUP BYreturnflag,linestatus
AGGR:No DLP
108
Implementations JIT
Whole query is processed in a single loop No intermediate results materialized
Vectorized Query is processed in blocks and data is accessed in
columnar fashion Intermediate results materialized Better SIMD and exploits cache locality
Hybrid Partition the query to utilize the DLP available without
materializing intermediate results much.
109
Result: TPC-H Query 1
If no DLP available, SIMD performs poorly.DySER speedups >2.5X by exploiting pipeline parallelism
110
Database Conclusion DySER exploits both pipeline parallelism and DLP
When DLP present, DySER provides >2x speedup, so does SIMD DySER can speedup even aliasing/control present
For kernels with low computation/memory ratio Integrating LD/ST units with DySER may help But explicit aliasing and high bandwidth are required
Combining multiple database kernels to exploit pipeline parallelism in DySER improves performance Requires careful looping strategies to utilize DySER
better
111
Support for Irregular Workloads
112
Outline
Introduction
Architecture Changes
Compiler Changes
Evaluation
DySER and Irregular code462.libquantum
for(i=0; i<reg->size; i++){ if(reg->node[i].state & ((MAX_UNSIGNED) 1 << control1)) if(reg->node[i].state & ((MAX_UNSIGNED) 1 << control2)) reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);}
Loop Invariants
113
DySER and Irregular code462.libquantum
...Loop: load reg->node[i], %r1 andcc %r1, Ctrl1 bz NextIter andcc %r1, Ctrl2 bz NextIter xor %r1, Tgt, %r2 st %r2, reg->node[i]NextIter: ... b Loop ...
... Loop: Dload reg->node[i], p0 Dsend Ctrl1, p1 Dsend Ctrl2, p2 Dsend Tgt, p3 Drecv p4, valid cmp valid, 0 bz NoStore Dstore p5, reg->node[i] b MergeNoStore Drecv p5, dummyMerge: ... b Loop
Scalar Code DySER Code
Issue: Invariant Sends114
DySER and Irregular code462.libquantum
...Loop: load reg->node[i], %r1 andcc %r1, Ctrl1 bz NextIter andcc %r1, Ctrl2 bz NextIter xor %r1, Tgt, %r2 st %r2, reg->node[i]NextIter: ... b Loop
... Loop: Dload reg->node[i], p0 Dsend Ctrl1, p1 Dsend Ctrl2, p2 Dsend Tgt, p3 Drecv p4, valid cmp valid, 0 bz NoStore Dstore p5, reg->node[i] b MergeNoStore Drecv p5, dummyMerge: ... b Loop
Scalar Code DySER Code
Issue: Branch on DRECV is Expensive 115
DySER and Irregular code462.libquantum
...Loop: load reg->node[i], %r1 andcc %r1, Ctrl1 bz NextIter andcc %r1, Ctrl2 bz NextIter xor %r1, Tgt, %r2 st %r2, reg->node[i]NextIter: ... b Loop
... Loop: Dload reg->node[i], p0 Dsend Ctrl1, p1 Dsend Ctrl2, p2 Dsend Tgt, p3 Drecv p4, valid cmp valid, 0 bz NoStore Dstore p5, reg->node[i] b MergeNoStore Drecv p5, dummyMerge: ... b Loop
Scalar Code DySER Code
Issue: Receives need to drain even invalid outputs116
for( ; arc < stop_arcs; arc += nr_group ) { if( arc->ident > BASIC ) { red_cost = arc->cost - arc->tail->potential + arc->head->potential; if( (red_cost < 0 && arc->ident == AT_LOWER) || (red_cost > 0 && arc->ident == AT_UPPER) ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); } } }
DySER and Irregular Code429.MCF
AccessCode
Issue: Control Dependent memory117
DySER ISA for Irregular Workloads
DySER Send Invariant Instruction dysendinv <reg>, <port>
DySER invocation started Instruction dystart
DySER Branch Instruction dybz <port>, Label dybnz <port>, Label
118
DySER Output Interface
DySER
“Invalid” Data
119
DySER Output Interface
DySER
Mark it as aborted and discard the
value
120
Outline
Issues with Irregular code
Simulator Fixes
Architecture Changes
Compiler Changes
Evaluation and Results
121
Compiler Changes
Slicing: Do not back slice through the control edges. Reason: DySER branch instruction Offloads more instructions to DySER
Code generator Changes Emit new DySER instructions No need to insert Dummy Insert Instructions
122
Outline
Issues with Irregular code
Simulator Fixes
Architecture Changes
Compiler Changes
Evaluation and Results
123
DySER and Irregular code462.libquantum
...Loop: load reg->node[i], %r1 andcc %r1, Ctrl1 bz NextIter andcc %r1, Ctrl2 bz NextIter xor %r1, Tgt, %r2 st %r2, reg->node[i]NextIter: ... b Loop
... Dsendinv Ctrl1, p1 Dsendinv Ctrl2, p2 Dsendinv Tgt, p3 Loop: Dstart Dload reg->node[i], p0 Dbrz p4, NextIter Dstore p5, reg->node[i]NextIter: ... b Loop
Scalar Code DySER Code
124
for( ; arc < stop_arcs; arc += nr_group ) { if( arc->ident > BASIC ) { red_cost = arc->cost - arc->tail->potential + arc->head->potential; if( (red_cost < 0 && arc->ident == AT_LOWER) || (red_cost > 0 && arc->ident == AT_UPPER) ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); } } }
DySER and Irregular Code429.MCF
AccessCode
125
Execute Code
Performance Results
bzip2 hmmer libquantum mcf h264ref0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
In order2-OOO4-OOO
Spee
dup
over
resp
ectiv
e sc
alar
ve
rsio
n
126
Energy Results
bzip2 hmmer libquantum mcf h264ref0%
5%
10%
15%
20%
25%
Inorder2-OOO4-OOO
Ener
gy R
educ
tion
(%)
127
128
Future Directions
129
Future Research Directions How can we make legacy code energy efficient
Use JIT compilation to target accelerators dynamically If source code available, from specialized IR Otherwise, from binary itself
Binary rewriters to target accelerators statically
Challenges: Analysis to identify acceleratable sequence of
instructions Light weight analysis for JIT Static analysis on compiled binary
Specialized IR design
130
Future Research Directions Energy efficient memory hierarchy (EEMH)
Moving data burns most of the energy Filtering data or performing operations in the
hierarchy itself will help reduce energy
Challenges Design: How to perform computation efficiently in
memory? Programming Model: How to program the EEMH? Compiler : What compiler algorithms or
transformations needed for EEMH?
132
DySER vs. Superscalar: Irregular
Series1
0%
5%
10%
15%
20%
Spee
dup
ASTAR
GCCH264
LIBQUANTUM
OMNETPP
SJENG
FLUID
ANIMATE
SWAPTIO
NS05
1015202530
Ener
gy R
educ
tion
(%)
133
Opportunities in Database Traditional Query Processing:
Vectorized Query Processing:
SCANPROJECTHASH
AGGREGATE
One Record
Output for one record
Output for Multiple Records
AGGREGATE
HASHSCANPROJEC
TMultiple Records
134
Database Kernels
DB Kernels with data-level parallelism SCAN SORT
DB Kernels with DLP and control SCAN on RLE HASH STRCMP
Data-level parallelism not readily available Aggregation
135
Database Kernels: Performance
SCANSCAN+RLESORT HASH STRCMP AGGR GM0
1
2
3
4
5
6
7
ScalarSSEDySERAutoDySERSp
eedu
p
Highly Data parallel
Provides speedup even with data intensive code