Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

1

Energy Efficient Computing through Compiler Assisted Dynamic Specialization

Venkatraman Govindaraju

Advisor: Karthikeyan Sankaralingam

(Defense: 7/29/2014)

2

Why energy efficiency?

1985 1990 1995 2000 2005 20100.001

0.01

0.1

1

10

100

1000

10000

100000 Transistors (in 100K)Power(W)Performance(GOPS)Efficiency(GOPS/W)

Moore’s Law is still valid

Limited by heat

Because of diminishing returns

Performance stagnates

Simplified Graph from “The Free Lunch Is Over”. Herb Sutter. In DDJ, March 2005

We must improve energy efficiency to scale performance

Year

FabScalar OpenSPARC

Where energy consumed?

Actual execution consumes only a fraction of energy

3

Reduce overhead energy to improve overall energy efficiency

Data is from “Power balanced pipelines” Sartori et al. in HPCA 2012

4

How to get efficiency?

Use accelerators or specialization

Efficiency

Gen

eral

ity,

Com

pile

r Effe

ctive

ness

General Purpose processor(GPP)

SIMD

ASIC

Flexible as GPP but with ASIC

efficiency?

5

DySER: Compiler Assisted Hardware Specialization

Efficiency

DesignComplexity

Generality Efficiency

Use specialized hardware for hot regions

Generality Reconfigurable at run-time Use encodings generated at

compile-time

Design Complexity Decoupled Access/Execute Use original core for

uncommon task

6

Evolution of DySER

Efficiency

Com

pile

r Effe

ctive

ness General Purpose

processor

SIMD: SSE

ASIC

DySER

DySER + DLP

DySER + DLP+ Slicer

Exploits DLP, and Vectorization for high efficiency

DSL programming[IEEE Micro, 2012]

AEPDG, new IR to model DySER

Auto compiles directly from C/C++ to DySER

[PACT 2013]

Dynamically specialized datapath

DSL programming[HPCA 2011]

7

What’s New?Preliminary Exam (8/12) Defense (7/14)

Architecture Basic DySER ISA, Vector DySER ISA (Prelim)

Vector DySER ISAISA for irregular workloads

Compiler Preliminary DesignPartial Implementation

Complete DesignSource code released

Evaluation Used high level pipeline models - SPEC INT, PARSEC

Accurate Simulator models - SPEC INT, PARSEC - Throughput Kernels - Parboil - Database

Publications Architecture (HPCA2011)Prototype (HPCA2012)

DySER+DLP (IEEE Micro 2012)Compiler (PACT 2013)Integration (HotChips 2012)Modeling (Micro 2014 – In Submission)

Outline

Introduction

DySER: Architecture

Intermediate Representation:Access/Execute PDG

Slicer: Compiler

Evaluation and Results

Conclusion8

DySER Overview

9

DySER

• Circuit-switched array of functional units• Integrated to processor pipeline• Dynamically creates specialized datapath

Fetch Decode Execute Memory WriteBack

D$

I$Register

File

Decode ExecUnits

DySER Datapath

10

DySER Configuration

Use same network for configuration bits

Configure once – reuse many times11

DySER Execution Model:Decoupled Access/Execute Model

Memory access instructions execute in processor pipeline Address Calculation,

Loads, and Stores Configure DySER Send Data to DySER Recv Data from DySER Loop control

Computation executes in DySER

12Processor DySER

Config

________________________

____________

x

-

++

+

x

-

++

+

________________________

JMP LOOP

Execution Example

13

FU

S S

S

S S

S

FUS S

FUS S

FUS

FUS

FUS

FUS

FUS

FUS

Input FIFO

OutputFIFO

IP0 IP1 IP2 IP3

OP0 OP1 OP2 OP3

Config Path

DySER Program//Vector Dot ProductDyINIT (0xABCD)DyINIT (0xEF00)…SUM=[0,0];for(int i =0; i < LEN; i+=2) { DySend_Vec(SUM, IP0); DyLoad_Vec(a[i:i+1], IP1); DyLoad_Vec(b[i:i+1], IP2); DyRecv_Vec(OP2, SUM);}sum= accum(SUM);//(last iteration here)return sum;

Execution Example

14

S S

S

S S

S

S S

S S

×S

+S

S

S

S

S

Input FIFO

OutputFIFO

IP0 IP1 IP2 IP3

OP0 OP1 OP2 OP3

DySER Program//Vector Dot ProductDyINIT (0xABCD)DyINIT (0xEF00)…SUM=[0,0];for(int i =0; i < LEN; i+=2) { DySend_Vec(SUM, IP0); DyLoad_Vec(a[i:i+1], IP1); DyLoad_Vec(b[i:i+1], IP2); DyRecv_Vec(OP2, SUM);}sum= accum(SUM);//(last iteration here)return sum;

15

Why does it work?

Applications execute in phases

Applications follow 90-10 rule 10% of code-region contributes to 90% of run time

Creating specialization for such code-regions amortizes the overheads

Where does performance come from?

Removing instructions from main pipeline Less use of Instruction Queue, ROB, Register File Effectively larger instruction window

Decoupled Execution Concurrency between main processor and DySER Many FUs -> High Potential ILP

Benefits of Vectorization Fewer memory access instructions Explicit pipelining of DySER

16

17

Energy Savings?

Eliminates per instruction overheads No fetch, decode etc., No expensive register reads etc.,

High performance itself leads to energy savings No additional power-hungry structures

Outline

Introduction

DySER: Architecture


Slicer: Compiler


Conclusion18

19

Compiler Intermediate Representation

Makes it easier to optimize for target architecture

A suitable IR should Model the architecture, accurately if possible Capture the dependencies between the

operations Generate code for the architecture with ease

DySER Architecture:Configurable Datapath

20

Configure switches and functional units to create different datapath

Can specialize datapath For ILP For DLP

Allows acceleration of variety of computation patterns

×S S

S

S S

S

S S

×S

+S

×S

+S

S S S

+S

Mul-Accumulate

-S S

S

S S

S

&S S

S

>S

S

S

S S

?:S

+S

Sum of Abs. Differences

×S S

S

S S

S

+S S

×S

+S

×S

+S

S S S S

3x3 Convolution

21

Compiler IR for DySER: Modeling Configurable Datapath

Graph based Nodes represent the operations/instruction Edges represent dependence between the operations

Easier to map computation to DySER

for (i = 0; i < N; ++i) C[i] += A[i] * B[i]

LDLD

×

ST

+2

×

S S

S

S

S

+S S

+S

×S

in1 in2

×

+2

out

DySER Architecture:Control Flow Mapping

22

S S

S

S S

S

S S

>S

+S

S

-S

S S S

φS

Predication Predicates the output A metabit in datapath

propagates the validity of the data

“Select” function unit (PHI functions) Selects valid input and

forwards as its output

in0 in1

Out

Pred.

V V

PHI

V

0 1

1in1

Native control flow mapping allows accelerating code with arbitrary control-flow

23

Compiler IR for DySER: Modeling Control Flow mapping

Special edges to represent control dependence

Special node to model PHI instruction

for (i = 0; i < N; ++i): if b[i] < 0: a = b[i] + 5; else: a = b[i] - 5; b[i] = a;

-+

<

LD

ST

φ

b+i

DySER Architecture:Decoupled Access/Execute Execution

24

S S

S

S S

S

S S

S S

S

S

S

S

S

S

Input FIFO

OutputFIFO

IP0 IP1 IP2 IP3

OP0 OP1 OP2 OP3

Processor sends data to DySER through its input FIFOs (input ports)

DySER computes in data flow fashion

Processor receives data from DySER through its output FIFOs (output ports)

Allows DySER to consume data in different order than how it is stored

25

Compiler IR for DySER: Modeling Decoupled Access/Execute Execution

Explicitly partitioned into Access and Execute PDG

for (i = 0; i < N; ++i): if b[i] < 0: a = b[i] + 5; else: a = b[i] - 5; b[i] = a;

-+

<

LD

ST

φ

b+i

+ -

<

φ

ExecutePDG

b[i+1]b[i+0]

DySER Architecture: Flexible Vector Interface

struct vec { float x, y, z; float q;}vec A[], B[];float *a = A, *b = B;float dot[];for(int i =0; i < LEN; i+=1) { dot[i]=A[i].x*B[i].x +A[i].y*B[i].y +A[i].z*B[i].z;}

26

× × ×

+

+

a[i] a[i+1] a[i+2]

dot[i]

b[i+2]

S S

S

S S

S

S S

S S

S

S

S

S

S

S

DySER Architecture: Flexible Vector Interface

× × ×

a[0]a[4]

a[1]a[5]

a[2]a[6]

++

How do weget this accesspattern?

Iteration 2Iteration 1

27

struct vec { float x, y, z; float q;}vec A[], B[];float *a = A, *b = B;float dot[];for(int i =0; i < LEN; i+=1) { dot[i]=A[i].x*B[i].x +A[i].y*B[i].y +A[i].z*B[i].z;}

Ports shown only for a[]

DySER Architecture:Flexible Vector Interface

A flexible mechanism to map from contiguous inputs to arbitrary DySER I/Os.

Add a “Vector Port” before FIFOs.

Add a “Vector Map” which tells how data should be transferred.

Data is processed with a state machine when data arrives

04

0 1 32 4 5 76

15

26P0 P1 xP2 P0 P1 xP2

Vector Port:

Vector Port Map:

P0 P1 P2 P3

28

× × ×

Input FIFO

29

DySER Architecture:Flexible Vector Interface

S S

S

S S

S

S S

S

S

S

S

IP0 IP1 IP2 IP3

3210 7654Memory/Vector register

0 1 2 30123

0 12 3

“Vector Port Mapping”Allows accelerating code region with different

memory access patterns (eg. Strided)

Compiler IR for DySER: Modeling Flexible Vector Interface

30

× × ×

+

+

a[i] a[i+1] a[i+2]

out[i]

× × ×

+

+

a[i] a[i+1] a[i+2]

out[i]

a[i+3] a[i+4] a[i+5]

out[i+1]

03

14

25

P0 P1 P2 P3

× × ×

0 1 32 4 5 760

10 1 0

1

10

Vector Port (for a[])

Original AEPDG Unrolled AEPDG Vector Map Generation(Load/Store Coalescing)

x x xx x x xxVector Port Map

P0 x P0x x x xxP0 P1 P0x P1 x xxP0 P1 P0P2 P1 P2 xx

Each edge on the interface knows its order

Compiler IR for DySER: Modeling Flexible Vector Interface

31

× × ×

+

+

a[i] a[i+1] a[i+2]

out[i]

× × ×

+

+

out[i:i+1]

04

15

26

P0 P1 P2 P3

× × ×

0 1 32 4 5 76Vector Port (for a[])

Original AEPDG Unrolled AEPDG Vector Map Generation

x x xx x x xxVector Port Map

P0 x xx P0 x xxP0 P1 xx P0 P1 xxP0 P1 P0P2 P1 P2 xxa[i:i+5]

32

Compiler IR: Access Execute Program Dependence Graph (AEPDG)

A variant of PDG

Nodes represent operations

Edges represent both data and control dependence

Explicitly partitioned into access-PDG and execute-PDG subgraph

Edges between access and execute-PDG augmented with temporal information

-+

<

LD

ST

φ

b+i

+ -

<

φ

Outline

Introduction

DySER: Architecture


Slicer: Compiler


Conclusion33

Compilation Tasks Identify code-regions/loops

to specialize

Construct AEPDG Access PDG Execute PDG

Perform Vectorization/ Optimizations

Schedule Execute PDG to DySER Access PDG to core 34

Core aaaDySER

Scheduling

VectorizationOptimization

Execute PDG

Access PDG

Identification & Construct AEPDG

Application

Core aaaDySER

Scheduling


Execute PDG

Access PDG

Region Identification

Application

Region Identification Identify code-regions to

specialize Path Profiling Utilize Loops

Need Single-Entry / Single Exit Region

35

SpecializationRegion

Construct AEPDG Build Program Dependence Graph Separate memory access from

computation. Loads/Stores and all dependent

computation are access.

36

a[i]

×

b[i]

+2

c[i]

a+i b+i c+iAddress Calc:

Loads:

Store:Core aaaDySER

Scheduling


Execute PDG

Access PDG


Application

Construct AEPDG

37

a[i]

×

b[i]

+2

c[i]


Loads:

Store:

a+i b+i c+i

×

+2

Build Program Dependence Graph Separate memory access from



Core aaaDySER

Scheduling


Execute PDG

Access PDG


Application

Construct AEPDG

38

a[i]

×

b[i]

+2

c[i]


Loads:

Store:

a+i b+i c+i

×

+2

a+i b+i c+i




Core aaaDySER

Scheduling


Execute PDG

Access PDG


Application

Construct AEPDG

39

a[i]

×

b[i]

+2

c[i]


Loads:

Store:

ExecuteSubregion




Core aaaDySER

Scheduling


Execute PDG

Access PDG


Application

Vectorization

40

• Similar to SIMD Techniques, loops must have:– Independent Iterations– Must be no Store/Load Aliasing

• Memory Access: No gather/scatter• Perform Loop Control

– Modify trip count/peel scalar loop

a[i]

×

b[i]

+2

c[i]Core aaaDySER

Scheduling


Execute PDG

Access PDG


Application

Vectorization

41

• Similar to SIMD Techniques, loops must have:– Independent Iterations– Must be no Store/Load Aliasing

• Memory Access: No gather/scatter• Perform Loop Control

– Modify trip count/peel scalar loop

a[i:i+3]

×

b[i:i+3]

+2

c[i:i+3]

Data is pipelinedthrough DySER

Core aaaDySER

Scheduling


Execute PDG

Access PDG


Application

Scheduling

42

in1

×

in2

+2

out

×

S S

S

S

S

+S S

+S

×S

in1 in2

• Map Execute Subregion to DySER– Sort nodes in data flow order– Greedily place each node to

minimize the total routes

Core aaaDySER

Scheduling


Execute PDG

Access PDG


Application

Scheduling

43

in1

×

in2

+2

out

×

S S

S

S

S

+S S

+S

×S

in1 in2

×



Core aaaDySER

Scheduling


Execute PDG

Access PDG


Application

Core aaaDySER

Scheduling

Vectorization

Execute Code

Access Code


Application

Scheduling

44

in1

×

in2

+2

out

×

S S

S

S

S

+S S

+S

×S

in1 in2

×



Core aaaDySER

Scheduling

Vectorization

Execute Code

Access Code


Application

Scheduling

45

in1

×

in2

+2

out

×

S S

S

S

S

+S S

+S

×S

in1 in2

×

out

+2



Core aaaDySER

Scheduling

Vectorization

Execute Code

Access Code


Application

Scheduling

46



in1

×

in2

+2

out

×

S S

S

S

S

+S S

+S

×S

in1 in2

×

+2

out

Case Study: Loop Dependence//Needleman Wunschint a[],b[]; //initialize

for(int i =1; i < NCOLS; ++i) { for(int j = 1; j < NROWS; ++j) { a[i][j]=max(a[i-1][j-1]+b[i][j], a[i-1][j], a[i][j-1]) }}Outer Iterations are dependent, too

+max

max

a[i-1][j-1] a[i-1][j]

Use result of previous iteration

a[i][j-1]

a[i][j]

+max

max

a[i-1][j] a[i-1][j+1]

a[i][j+1]

Vectorizable!

47

Dependence Chain

Array a[]

+max

max

+max

max

Outline

Introduction

DySER: Architecture


Slicer: Compiler


Conclusion48

49

Evaluation Methodology Simulation Framework

Gem5 + DySERsim for performance McPAT for energy

Compiler Implementation Leverages LLVM compilation framework Constructs AEPDG from LLVM-IR Generates binary for x86, SPARC

Benchmarks Throughput Workloads: Intel TPT kernels, Parboil benchmark

suite General purpose Workloads: SPEC-2006, PARSEC Database: Operators and Primitives, Query

50

Evaluation

What is the performance/energy benefits? DLP workloads General Purpose or Irregular workloads

How effective the compiler is?

How effective on database query processing? Both DLP and Irregular in a same application

51

DySER vs. Superscalar: DLP

Series1

0

2

4

6

8

10

Spee

dup

CONV

MERGE

NBODY

RADAR

TrSRCH VR

CUTCPFF

T

KMEANS

LBM

MMM

RI-QSP

MV

STENCIL

TPACFNNW

NEEDLE GM0

102030405060708090

100

Ener

gy R

educ

tion

(%)

Control flow in memory accessMultiple Configurations:

Configuration cost starts to dominate

Indirect memory access, Loop carried dependences

DySER performs on average 3.4x better than baseline with 53% reduction in energy consumption

52

DySER vs. Superscalar: General Purpose

Se-ries1

0%

5%

10%

15%

20%

Spee

dup

ASTAR

BZIP2

H264

HMM

ER

LIBQUANTUM

MCF

BLACKSC

HOLES

FLUID

ANIMATE

FREQM

INE

SWAPTIO

NS

STREAM

CLUST

ERGM

0

10

20

30

Ener

gy R

educ

tion

(%)

DySER provides 8% mean speedup with 11% reduction in energy consumption

Data dependent branches mapped to DySER, which leads less pipeline flushes

Exploits DLP available, but control

dependent stores prevent large gain

53

Where is the efficiency come from?

CONV

MERGE

NBODY

RADAR

TrSRCH VR

CUTCPFF

T

KMEANS

LBM

MMM

RI-Q

NEEDLENNW

SPM

V

SPENCIL

TPACFGM

0123456789

10

DySER IPCCore IPCBaseline IPC

Effec

tive

IPC

DySER emulates a wider issue processor than the baseline processor

54

DySER vs. Superscalar: Summary

TPT Kernels

Parboil SPECINT PARSEC0

1

2

3

4

5Sp

eedu

p

TPT Kernels Parboil SPECINT PARSEC0

20

40

60

80

100

Ener

gy R

educ

tion

(%)

11% 20%

10% 11%

On DLP workloads, DySER provides significant

improvements

On irregular workloads, DySER provides modest improvements

55

Performance: SSE/AVX Vs. DySER

CONV

MERGE

NBODY

RADAR

TREESEARCH VR

CutCP

FFT

KMEANS

LBM

MMM

RI-QSP

MV

STENCIL

TPACFNNW

NEEDLE HM0

1

2

3

4

5

SSEAVXDySER

Spee

dup

Ove

r SSE

13x

DySER bottlenecked by FDIV/FSQRT units

When DLP readily available, both SIMD and DySER perform better

With control intensive code, DySER perform better

DySER performs on average 1.8x better than SSE/AVX

Why DySER is efficient than SIMD? SIMD vectorizes either inside the loop:

Superword-level-parallelism

Or, SIMD vectorizes across loop iterations

DySER can simultaneously vectorize both:

56

SIMD - SLP DySERSIMD – “Do Across”

57

Programmer Optimized vs. Compiler Optimized

CONV

MERGE

NBODY

RADAR

TREESEARCH VR

CutCP

FFT

KMEANS

LBM

MMM

RI-QSP

MV

STENCIL

TPACFNNW

NEEDLE HM0

0.2

0.4

0.6

0.8

1

Compiler

Rela

tive

to p

rogr

amm

er o

ptim

ized

Outer Loop Transformations

Different strategy for Reduction

Constant Table

Lookup

Compiler generated code’s slowdown is only 30%

58

Why Database?

Energy efficient in-core accelerator

Dynamically specializes frequently executing codes

DySER

Energy management is emerging as a primary goal

Query processing with database kernels/Primitives

59

Simplified TPC-H Query 1

SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)

FROMlineitem

WHEREship_date <= XXXX

GROUP BYreturnflag,linestatus

60

Query


FROMlineitem



Projection: Highly data

parallel

61

Query


FROMlineitem



SCAN:Highly data

parallel

62

Query


FROMlineitem



HASH:Data parallel with Control

63

Query


FROMlineitem



AGGR:Limited DLP

64

Query Processing Implementations JIT

Whole query is processed in a single loop No intermediate results materialized

Vectorized Query is processed in blocks and data is accessed in

columnar fashion Intermediate results materialized Better for SIMD and exploits cache locality

Hybrid Partition the query to utilize the DLP available without

materializing intermediate results much.

65

Result: TPC-H Query 1

0

1

2

3

4

Scalar SIMD DySER

Spee

dup

JIT

Vectorized

Hybrid

Since no DLP available, SIMD performs poorly.DySER speedups >2.5X by exploiting pipeline parallelism

Hardware/software codesign improves query processing significantly

66

How about design complexity? We (five graduate students) implemented a prototype of

DySER integrated to OpenSPARC Prototype mapped onto a Xilinx Virtex 5 FPGA board Boots Unmodified Ubuntu 7.10 Linux DySER is not in the critical path!

Design, Integration, and Implementation of the DySER hardware Accelerator into OpenSPARC, in HPCA 2012

DySER is indeed a non-intrusive design and easy to integrate to a commercial processor

67

Conclusion

Must rethink & co-design architecture, micro-architecture, and compilers Make energy as a primary constraint Incremental evolution of historical accelerators

has produced diminishing returns

Compiler assisted hardware specialization Provides energy-efficient without the loss of

generality and with low design complexity

68

Publications [GNS PACT 2013] Breaking SIMD Shackles with an Exposed Flexible Micro -

architecture and the Access Execute PDG. In PACT 2013.

[GHNCSSK IEEE Micro 2012] DySER: Unifying Functionality and Parallelism Specialization for Energy Efficient Computing. IEEE Micro Sep/Oct 2012.

[BCFGHMNS HotChips 2012] Design, Integration and Implementation of DySER Hardware Accelerator into OpenSPARC, In HotChips 2012.

[BCFHGNS HPCA 2012] Design, Integration and Implementation of DySER Hardware Accelerator into OpenSPARC, In HPCA 2012.

[NSHGDS ISCA 2011] Sampling + DMR: Practical and Low-overhead Permanent Fault Detection, In ISCA 2011

[GHS HPCA 2011] Dynamically Specialized Datapaths for Energy Efficient Computing. In HPCA 2011

[GDSVM Micro 2008] Toward A Multicore Architecture for Real-time Ray-tracing. In Micro 2008

69

Acknowledgements

Prof. Karu Sankaralingam

Marc de Kruijf

Tony Nowatzki

Lena Olson

DySER Team: Chen-Han, Tony, Chris, Ryan, Zach, Jesse

?70

71

Backup Slides

DySER Datapath

72

• Ready (R) – for flow control (forward)• Credit (C) – for flow control (Backward)• Valid (V) – To support control-flow

C Vdata R

Processor Integration: in-order

DySER interface: FIFO

73

Fetch Decode Execute Memory WriteBack

D$

I$Register

File

Decode ExecUnits

DySER

Out-of-Order Integration

Out-of-order core integration

DySER itself maintains no architectural state

Use buffers to keep the state for speculative execution

74

Small loops Leverage loop properties.

Simply unroll the loop further, “Cloning” the region

×

+

a[0]a[1]a[2]a[3]

b[0]b[1]b[2]b[3]

c[0]c[1]c[2]c[3]

×

+

a[0]a[2]

b[0]b[2]

c[2]c[0]

×

+

a[1]a[3]

b[1]b[3]

c[3]c[1]

Input FIFO

OutputFIFO

ExecuteRegion

Before After

75

Also UsesFlexible I/O

Large Loops: Sub-graph Matching Subgraph Matching

Find Identical computations, split them out

Region Splitting Configure multiple regions, quickly switch between them

76

Large Region Subgraph Matching Region Virtualization

77

Results: Configuration Cost

Programs follow 90/10 rule

blacksch

oles

canneal

fluidanimate

streamclu

ster

bzip2

mcf

h264ref

soplex

sphinx3

Mean0%

10%

20%

Percentage of Code-region Contributing to 90% running time

78

Energy: SSE/AVX Vs. DySER

CONV

MERGE

NBODY

RADAR

TREESEARCH VR

CutCP

FFT

KMEANS

LBM

MMM

RI-QSP

MV

STENCIL

TPACFNNW

NEEDLE GM0

20

40

60

80

100

SSEDySER

% E

nerg

y Re

ducti

on

79

Related Work: Architecture

Reconfigurable system FPGA -- High software cost

Coarse grain reconfigurable system Beret – uses a pre designed set of SEB Micro 2011 C-Cores – Uses a set of Ccores to accelerate

functions ASPLOS 2010 VEAL and CCA – loop accelerator ISCA 2008 Other reconfigurable coprocessor approach

Garp, Tartan, Chimera etc.,

80

Related Work: Beret

An energy efficient coprocessor

No internal control-flow

Uses a set of SEB (subgraph execution block)

81

Related work: Conservation Cores

A set of specialized units that accelerates a whole function.

Slow, no pipelining support

82

Other Publications

Reliability

Sampling + DMR: Practical and Low-overhead Permanent Fault Detection, In ISCA 2011.

Specialized Architecture

Toward A Multicore Architecture for Real-time Ray-tracing, In Micro 2008

83

Database Backup Slides

84

Evaluation Methodology

Implemented optimized versions Baseline: C (no special operations) SSE: manually optimized with compiler intrinsics DySER: manually optimized with DySER instructions AutoDySER: Automatically DySERized using a compiler

Evaluated using a gem5 based simulator X86, an out-of-order CPU model

85

SCAN

Scans a table with equality predicate

High data level parallelism

kernel: //inputs: in_mask:bitvector, col, key//output: out_mask:bitvector

for (i = 0; i < LEN; i += SZ): for (j = 0; j < SZ; ++j): out |= (col[i*SZ+j] == key) << j out_mask[i] = in_mask[i] & out

86

SCAN: Results

If DLP available, both DySER and SIMD performs well

87

Aggregation

Kernel:

Indirect memory access

Represents worst case for DySER

for (i = 0; i < LEN; i++): key = k[i] A[key] += V[i]

88

Why it is hard for DySER

Mostly address calculation

Computation is just one instruction

KA

LD

+

LD

V

LD

+

ST

i

89




KA

LD

+

LD

V

LD

+

ST

i

90




Aliasing prevents loop unrolling

KA

LD

+

LD

V

LD

+

ST

i

91

Why it is hard for DySERKA

LD

+

LD

V

LD

+

ST

i

KA

LD

+

LD

V

LD

+

ST

i+1

may dependence

92

Solution: Alias Checking in DySERKA

LD

+

LD

V

LD

+

ST

i

KA

LD

+

LD

V

LD

+

ST

i+1

==

?

93

Aggregation Results

With out-of-order processor, DySER provides speedup. But with inorder processor, it performs poorly.

94

Database Kernels

DB Kernels with data-level parallelism SCAN SORT

DB Kernels with DLP and control SCAN on RLE HASH STRCMP (Variable length)

Data-level parallelism not readily available Aggregation

95

Results

X86 Inorder

96

Overview

Database Kernels Characterization and Evaluation

Codesigning DySER/DB

Conclusion

97

Codesigning DySER/DB

DySER effectiveness reduces when memory operations dominate Integrate Load and Store with DySER Memory Access Dataflow (MAD) Architecture

Is vectorized query processing a problem for DySER? Compute/memory ratio is low JIT query processing may enable DySER to

exploit pipeline parallelism better

100

A Simple Query

SELECTprice * (1-disc),price * (1-disc) * (1+tax))

FROMlineitem

101

Query 1 Implementation Query Processing:

Vectorized Query Processing:

Out_1 = price * (1-disc)Out_2 = Out_1 * (1+tax)

Inputs Outputs

tmp_out1 = (1-disc)

tmp_out2 = (1+tax)

out1 = price * tmp_out1

out2 = out_1 * tmp_out2

102

Result: Query 1

When fully data-parallel, both SIMD and DySER performs well

103

Slightly Complex Query (TPC-H Q1)


FROMlineitem



104

Query


FROMlineitem



Projection: Highly data

parallel

105

Query


FROMlineitem



SCAN:Highly data

parallel

106

Query


FROMlineitem



HASH:Data parallel with Control

107

Query


FROMlineitem



AGGR:No DLP

108

Implementations JIT

Whole query is processed in a single loop No intermediate results materialized

Vectorized Query is processed in blocks and data is accessed in

columnar fashion Intermediate results materialized Better SIMD and exploits cache locality

Hybrid Partition the query to utilize the DLP available without

materializing intermediate results much.

109

Result: TPC-H Query 1

If no DLP available, SIMD performs poorly.DySER speedups >2.5X by exploiting pipeline parallelism

110

Database Conclusion DySER exploits both pipeline parallelism and DLP

When DLP present, DySER provides >2x speedup, so does SIMD DySER can speedup even aliasing/control present

For kernels with low computation/memory ratio Integrating LD/ST units with DySER may help But explicit aliasing and high bandwidth are required

Combining multiple database kernels to exploit pipeline parallelism in DySER improves performance Requires careful looping strategies to utilize DySER

better

111

Support for Irregular Workloads

112

Outline

Introduction

Architecture Changes

Compiler Changes

Evaluation

DySER and Irregular code462.libquantum

for(i=0; i<reg->size; i++){ if(reg->node[i].state & ((MAX_UNSIGNED) 1 << control1)) if(reg->node[i].state & ((MAX_UNSIGNED) 1 << control2)) reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);}

Loop Invariants

113


...Loop: load reg->node[i], %r1 andcc %r1, Ctrl1 bz NextIter andcc %r1, Ctrl2 bz NextIter xor %r1, Tgt, %r2 st %r2, reg->node[i]NextIter: ... b Loop ...

... Loop: Dload reg->node[i], p0 Dsend Ctrl1, p1 Dsend Ctrl2, p2 Dsend Tgt, p3 Drecv p4, valid cmp valid, 0 bz NoStore Dstore p5, reg->node[i] b MergeNoStore Drecv p5, dummyMerge: ... b Loop

Scalar Code DySER Code

Issue: Invariant Sends114


...Loop: load reg->node[i], %r1 andcc %r1, Ctrl1 bz NextIter andcc %r1, Ctrl2 bz NextIter xor %r1, Tgt, %r2 st %r2, reg->node[i]NextIter: ... b Loop



Issue: Branch on DRECV is Expensive 115





Issue: Receives need to drain even invalid outputs116

for( ; arc < stop_arcs; arc += nr_group ) { if( arc->ident > BASIC ) { red_cost = arc->cost - arc->tail->potential + arc->head->potential; if( (red_cost < 0 && arc->ident == AT_LOWER) || (red_cost > 0 && arc->ident == AT_UPPER) ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); } } }

DySER and Irregular Code429.MCF

AccessCode

Issue: Control Dependent memory117

DySER ISA for Irregular Workloads

DySER Send Invariant Instruction dysendinv <reg>, <port>

DySER invocation started Instruction dystart

DySER Branch Instruction dybz <port>, Label dybnz <port>, Label

118

DySER Output Interface

DySER

“Invalid” Data

119

DySER Output Interface

DySER

Mark it as aborted and discard the

value

120

Outline

Issues with Irregular code

Simulator Fixes


Compiler Changes


121

Compiler Changes

Slicing: Do not back slice through the control edges. Reason: DySER branch instruction Offloads more instructions to DySER

Code generator Changes Emit new DySER instructions No need to insert Dummy Insert Instructions

122

Outline

Issues with Irregular code

Simulator Fixes


Compiler Changes


123



... Dsendinv Ctrl1, p1 Dsendinv Ctrl2, p2 Dsendinv Tgt, p3 Loop: Dstart Dload reg->node[i], p0 Dbrz p4, NextIter Dstore p5, reg->node[i]NextIter: ... b Loop


124

for( ; arc < stop_arcs; arc += nr_group ) { if( arc->ident > BASIC ) { red_cost = arc->cost - arc->tail->potential + arc->head->potential; if( (red_cost < 0 && arc->ident == AT_LOWER) || (red_cost > 0 && arc->ident == AT_UPPER) ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); } } }

DySER and Irregular Code429.MCF

AccessCode

125

Execute Code

Performance Results

bzip2 hmmer libquantum mcf h264ref0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

In order2-OOO4-OOO

Spee

dup

over

resp

ectiv

e sc

alar

ve

rsio

n

126

Energy Results

bzip2 hmmer libquantum mcf h264ref0%

5%

10%

15%

20%

25%

Inorder2-OOO4-OOO

Ener

gy R

educ

tion

(%)

127

128

Future Directions

129

Future Research Directions How can we make legacy code energy efficient

Use JIT compilation to target accelerators dynamically If source code available, from specialized IR Otherwise, from binary itself

Binary rewriters to target accelerators statically

Challenges: Analysis to identify acceleratable sequence of

instructions Light weight analysis for JIT Static analysis on compiled binary

Specialized IR design

130

Future Research Directions Energy efficient memory hierarchy (EEMH)

Moving data burns most of the energy Filtering data or performing operations in the

hierarchy itself will help reduce energy

Challenges Design: How to perform computation efficiently in

memory? Programming Model: How to program the EEMH? Compiler : What compiler algorithms or

transformations needed for EEMH?

132

DySER vs. Superscalar: Irregular

Series1

0%

5%

10%

15%

20%

Spee

dup

ASTAR

GCCH264

LIBQUANTUM

OMNETPP

SJENG

FLUID

ANIMATE

SWAPTIO

NS05

1015202530

Ener

gy R

educ

tion

(%)

133

Opportunities in Database Traditional Query Processing:

Vectorized Query Processing:

SCANPROJECTHASH

AGGREGATE

One Record

Output for one record

Output for Multiple Records

AGGREGATE

HASHSCANPROJEC

TMultiple Records

134

Database Kernels

DB Kernels with data-level parallelism SCAN SORT

DB Kernels with DLP and control SCAN on RLE HASH STRCMP

Data-level parallelism not readily available Aggregation

135

Database Kernels: Performance

SCANSCAN+RLESORT HASH STRCMP AGGR GM0

1

2

3

4

5

6

7

ScalarSSEDySERAutoDySERSp

eedu

p

Highly Data parallel

Provides speedup even with data intensive code

Documents

Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)