Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

Energy-EfficientParallel Computer Architecture

Christopher Batten

Computer Systems LaboratorySchool of Electrical and Computer Engineering

Cornell University

Fall 2014

• Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns

Motivating Trends in Computer Architecture

Transistors(Thousands)

Frequency(MHz)

TypicalPower (W)

MIPSR2K

IntelP4

DECAlpha 21264

Data collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten1975 1980 1985 1990 1995 2000 2005 2010 2015

100

101

102

103

104

105

106

SPECintPerformance

107

Numberof Cores

Intel 48-CorePrototype

AMD 4-CoreOpteron

Parallelism?

? Trend 1Power & energy

constrain allsystems

Trend 2Transition to

multicoreprocessors

Trend 3Inevitable endof Moore’s law

Cornell University Christopher Batten 2 / 19


Performance (Tasks per Second)

En

erg

y E

ffic

ien

cy (

Ta

sks p

er

Joule

)

SimpleProcessor

Design PowerConstraint

High-PerformanceArchitectures

EmbeddedArchitectures

DesignPerformanceConstraint

Flexibility vs

. Spe

cializat

ion

CustomASIC

Less FlexibleAccelerator

More FlexibleAccelerator



Fine-Grain Heterogeneous Architectures

L2 Cache Bank

On-Chip Router

L2 Cache Bank

On-Chip Router

Latency-Optimized Tile: few large out-of-order cores

Data Memory

Instruction Memory

CPLD

ST ST

L2 Cache Bank

On-Chip Router

L2 Cache Bank

On-Chip Router

L1I$

GPPL1 I$

L1 D$

OoO GPP

L1 I$

L1 D$

OoO GPPL1D$

L1I$

GPP

L1D$

L1I$

GPP

L1D$

L1I$

GPP

L1D$

L1I$

GPP

L1D$

L1I$

GPP

L1D$

μT0

μT4

μT1

μT5

L1 I$

CP SIMT Issue Unit

SIMT Memory Unit

L1 D$

μT2

μT6

μT3

μT7

Throughput-Optimized Tile: many small in-order cores

Data-Parallel-Optimized Tile:flexible data-parallel accelerator

Application-Optimized Tile:instruction set specialization;flexible algorithm ordata-structure accelerators



Projects Within the Batten Research Group

GPGPUMicroarchitecture

XLOOPS: Specializationfor Loop Patterns

XPC: Explicit-Parallel-Call Architectures

instrA

instrB

pcall n, func1

...

psync func1:

instrC

pcall func2

...

pret

CtrlProc

L1 Data Cache

SIMT Lanes

SIMT Issue Unit

SIMT Mem Unit

Polymorphic HardwareSpecialization

L1I$

PolyASU

PolyASU

PolyDSU

PolyDSU

PolyDSU

Control Crossbar

Memory Crossbar

CP

L1 Data Cache

Reconfigurable PowerDistribution Networks

Control

Complexity EffectiveOn-Chip Networks

Hardware Accelerationfor Dynamic Languages

Scheme Python

JIT Compiler

Python-BasedHardware Modeling

class Reg (Model):

def __init__( s ):

s.in = InPort(32)

s.out = OutPort(32)

def elaborate( s ):

@s.posedge_clk

def seq_logic():

s.out.next = s.in



Vertically Integrated Research Methodology

Our research involves reconsidering all aspects of the computing stackincluding applications, programming frameworks, compiler optimizations,runtime systems, instruction set design, microarchitecture design, VLSI

implementation, and hardware design methodologies

CrossCompiler

FunctionalSimulator

Binary

C++ Applications

C++ FunctionalSpecification

Cycle-LevelSimulator

C++ Cycle-LevelModel

Layout

Verilog RTL Model

RTLSimulator

Gate-Level Model

Gate-LevelSimulator

Switching Activity

PowerAnalysis

SynthesisPlace&Route

Key Metrics: Cycle Count, Cycle Time, Area, Energy

Experimenting with full-chiplayout, FPGA prototypes, andtest chips is a key part of our

research methodology


Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •

XLOOPS: Architectural Specialization forInter-Iteration Loop Dependence Patterns

Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu,Zhiru Zhang, and Christopher Batten

47th ACM/IEEE Int’l Symp. on Microarchitecture (MICRO)Cambridge, UK, Dec. 2014



Inter-Iteration Loop Dependence Patterns

inst0inst1inst2inst3...branch

Iteration0


Iteration1


Iteration2


Iteration3


Iterationn-1

Explicit loop specialization (XLOOPS) elegantly encodes inter-iterationloop dependence patterns in the instruction set and enables

traditional, specialized, and adaptive execution.



XLOOPS ISA: Unordered Concurrent

for ( i=0; i<N; i++ )

C[i] = A[i] * B[i]

loop:

lw r2, 0(rA)

lw r3, 0(rB)

mul r4, r2, r3

sw r4, 0(rC)

addiu.xi rA, 4

addiu.xi rB, 4

addiu.xi rC, 4

addiu r1, r1, 1

xloop.uc r1, rN, loop

Iteration 0

inst0inst1inst2inst3...xloop.uc

Iteration 1


Iteration 2


Iteration 3




XLOOPS ISA: Ordered-Through-Registers

for ( X=0, i=0; i<N; i++ )

X += A[i]

B[i] = X

loop:

lw r2, 0(rA)

addu rX, r2, rX

sw rX, 0(rB)

addiu.xi rA, 4

addiu.xi rB, 4

addiu r1, r1, 1

xloop.or r1, rN, loop

Iteration 0

inst0inst1inst2inst3...xloop.or

Iteration 1


Iteration 2


Iteration 3




XLOOPS ISA: Ordered-Through-Memory

for ( i=K; i<N; i++ )

A[i] = A[i] * A[i-K]

# r1 = rK

# r3 = rA + 4*rK

loop:

lw r4, 0(r3)

lw r5, 0(rA)

mul r6, r4, r5

sw r6, 0(r3)

addiu.xi r3, 4

addiu.xi rA, 4

addiu r1, r1, 1

xloop.om r1, rN, loop

Iteration 0

inst0inst1inst2inst3...xloop.om

Iteration 1


Iteration 2


Iteration 3




XLOOPS ISA: Unordered Atomic

for ( i=0; i<N; i++ )

B[ A[i] ]++

D[ C[i] ]++

loop:

lw r6, 0(rA)

lw r7, 0(r6)

addiu r7, r7, 1

sw r7, 0(r6)

addiu.xi rA, rA, 4

...

addiu r1, r1, 1

xloop.ua r1, rN, loop

Iteration 0

inst0inst1inst2inst3...xloop.ua

Iteration 1


Iteration 2


Iteration 3




XLOOPS ISA: Dynamic Bound

Iteration 0

inst0inst1inst2inst3...xloop.db

Iteration 1


Iteration 2


Iteration 3


Iteration 4


Iteration 5




XLOOPS Compiler

Kernel implementing Floyd-Warshall shortest path algorithm

for ( int k = 0; k < n; k++ )

#pragma xloops ordered

for ( int i = 0; i < n; i++ )

#pragma xloops unordered

for ( int j = 0; j < n; j++ )

path[i][j] = min( path[i][j], path[i][k] + path[k][j] );

Kernel implementing greedy algorithm for maximal matching

#pragma xloops ordered

for ( int i = 0; i < num_edges; i++ )

int v = edges[i].v; int u = edges[i].u;

if ( vertices[v] < 0 && vertices[u] < 0 )

vertices[v] = u; vertices[u] = v; out[k++] = i;



XLOOPS Microarchitecture

GPR RF32 × 32b

2r2w

GPP

LLFU

D$ Request/Response Crossbar

L1 I$ 16 KB

L2 Request and Response Crossbars

L1 D$ 16 KB

SLFU

Lane3

Lane1

Lane RF24 × 32b

2r2w

Inst Buf128×

Lane RF24 × 32b

2r2w

Inst Buf128×

Lane RF24 × 32b

2r2w

Inst Buf128×

Lane0

SLFU SLFU SLFU

IDQ

Lane Management Unit

IDQ IDQ

CIB 8×CIB 8×CIB 8×

LSQ16×

LSQ16×

LSQ16×

DBN

Lane Management Unit I Elegant hardware/softwareabstraction enables efficienttraditional execution ongeneral-purpose processor

I Loop-pattern specialization unitenables specialized execution forimproved performance andefficiency

I Adaptive execution uses twoprofiling phases to adaptivelydetermine best location to runloop



XLOOPS Cycle-Level Performance Results

rgb2

cmyk

-uc

sgem

m-u

c

ssea

rch-

uc

sym

m-u

c

vite

rbi-u

c

war

-uc

adpc

m-o

r

cova

r-or

dith

er-o

r

kmea

ns-o

r

sha-

or

sym

m-o

r

dynp

rog-

om

knn-

om

ksac

k-sm

-om

ksac

k-lg-o

m

war

-om

mm

-orm

sten

cil-o

rm

btre

e-ua

hsor

t-ua

huffm

an-u

a

rsor

t-ua

bfs-

uc-d

b

qsor

t-uc-

db

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Sp

ee

du

p N

orm

aliz

ed

to

S

imp

le R

ISC

Co

re

OOO 2-Way OOO 4-Way OOO 2-Way with LPSU

xloop.uc xloop.or xloop.{om,orm} xloop.ua xloop.*.db

I XLOOPS vs. Simple Core : Always higher performanceI XLOOPS vs. OOO 2-way : Generally competitive or higher performanceI XLOOPS vs. OOO 4-way : Mixed results



XLOOPS Cycle-Level Energy-Efficiency Results

XLOOPS vs.Simple RISC Core

XLOOPS vs.OOO 2-Way

XLOOPS vs.OOO 4-Way

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0Normalized Performance

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Norm

aliz

ed E

nerg

y E

ffic

iency

0.5 1.0 1.5 2.0 2.5 3.0Normalized Performance

0.5 1.0 1.5 2.0 2.5Normalized Performance

I XLOOPS vs. Simple Core : Competitive energy efficiencyI XLOOPS vs. OOO 2-way : Always more energy efficientI XLOOPS vs. OOO 4-way : Always more energy efficient



DCache16KB SRAM for Cache Lines

DCacheTags

ICacheTags

ICache16KB SRAM for Cache Lines

L0Instr

Buffer

L0Instr

Buffer

L0Instr

Buffer

L0Instr

Buffer

Loop PatternSpecialization Unit

ScalarProcessor

32b IEEEFloating Point Unit

32b IntegerMul/Div Unit

VLSIImplementation

I TSMC 40 nmstandard-cell-basedimplementation

I RISC scalarprocessor with4-lane LPSU

I Supports xloop.uc

I Implemented inTSMC 40 nm usingASIC CAD toolflow

I ≈40% extra areacompared to simpleRISC processor


Batten Research Group Overview XLOOPS: Specialization for Loop Patterns

Take-Away Points

I We envision future computing systems usingfine-grain heterogeneous architectures with a mixof tiles optimized for latency, throughput, classesof applications, or even a specific application

I We are using a vertically integrated researchmethodology to explore a variety of projects thatwill hopefully enable such fine-grainheterogeneous architectures


Documents

Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures