19
Energy-Efficient Parallel Computer Architecture Christopher Batten Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University Fall 2014

Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

Energy-EfficientParallel Computer Architecture

Christopher Batten

Computer Systems LaboratorySchool of Electrical and Computer Engineering

Cornell University

Fall 2014

Page 2: Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

• Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns

Motivating Trends in Computer Architecture

Transistors(Thousands)

Frequency(MHz)

TypicalPower (W)

MIPSR2K

IntelP4

DECAlpha 21264

Data collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten1975 1980 1985 1990 1995 2000 2005 2010 2015

100

101

102

103

104

105

106

SPECintPerformance

107

Numberof Cores

Intel 48-CorePrototype

AMD 4-CoreOpteron

Parallelism?

? Trend 1Power & energy

constrain allsystems

Trend 2Transition to

multicoreprocessors

Trend 3Inevitable endof Moore’s law

Cornell University Christopher Batten 2 / 19

Page 3: Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

• Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns

Performance (Tasks per Second)

En

erg

y E

ffic

ien

cy (

Ta

sks p

er

Joule

)

SimpleProcessor

Design PowerConstraint

High-PerformanceArchitectures

EmbeddedArchitectures

DesignPerformanceConstraint

Flexibility vs

. Spe

cializat

ion

CustomASIC

Less FlexibleAccelerator

More FlexibleAccelerator

Cornell University Christopher Batten 3 / 19

Page 4: Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

• Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns

Fine-Grain Heterogeneous Architectures

L2 Cache Bank

On-Chip Router

L2 Cache Bank

On-Chip Router

Latency-Optimized Tile: few large out-of-order cores

Data Memory

Instruction Memory

CPLD

ST ST

L2 Cache Bank

On-Chip Router

L2 Cache Bank

On-Chip Router

L1I$

GPPL1 I$

L1 D$

OoO GPP

L1 I$

L1 D$

OoO GPPL1D$

L1I$

GPP

L1D$

L1I$

GPP

L1D$

L1I$

GPP

L1D$

L1I$

GPP

L1D$

L1I$

GPP

L1D$

μT0

μT4

μT1

μT5

L1 I$

CP SIMT Issue Unit

SIMT Memory Unit

L1 D$

μT2

μT6

μT3

μT7

Throughput-Optimized Tile: many small in-order cores

Data-Parallel-Optimized Tile:flexible data-parallel accelerator

Application-Optimized Tile:instruction set specialization;flexible algorithm ordata-structure accelerators

Cornell University Christopher Batten 4 / 19

Page 5: Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

• Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns

Projects Within the Batten Research Group

GPGPUMicroarchitecture

XLOOPS: Specializationfor Loop Patterns

XPC: Explicit-Parallel-Call Architectures

instrA

instrB

pcall n, func1

...

psync func1:

instrC

pcall func2

...

pret

CtrlProc

L1 Data Cache

SIMT Lanes

SIMT Issue Unit

SIMT Mem Unit

Polymorphic HardwareSpecialization

L1I$

PolyASU

PolyASU

PolyDSU

PolyDSU

PolyDSU

Control Crossbar

Memory Crossbar

CP

L1 Data Cache

Reconfigurable PowerDistribution Networks

Control

Complexity EffectiveOn-Chip Networks

Hardware Accelerationfor Dynamic Languages

Scheme Python

JIT Compiler

Python-BasedHardware Modeling

class Reg (Model):

def __init__( s ):

s.in = InPort(32)

s.out = OutPort(32)

def elaborate( s ):

@s.posedge_clk

def seq_logic():

s.out.next = s.in

Cornell University Christopher Batten 5 / 19

Page 6: Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

• Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns

Vertically Integrated Research Methodology

Our research involves reconsidering all aspects of the computing stackincluding applications, programming frameworks, compiler optimizations,runtime systems, instruction set design, microarchitecture design, VLSI

implementation, and hardware design methodologies

CrossCompiler

FunctionalSimulator

Binary

C++ Applications

C++ FunctionalSpecification

Cycle-LevelSimulator

C++ Cycle-LevelModel

Layout

Verilog RTL Model

RTLSimulator

Gate-Level Model

Gate-LevelSimulator

Switching Activity

PowerAnalysis

SynthesisPlace&Route

Key Metrics: Cycle Count, Cycle Time, Area, Energy

Experimenting with full-chiplayout, FPGA prototypes, andtest chips is a key part of our

research methodology

Cornell University Christopher Batten 6 / 19

Page 7: Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •

XLOOPS: Architectural Specialization forInter-Iteration Loop Dependence Patterns

Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu,Zhiru Zhang, and Christopher Batten

47th ACM/IEEE Int’l Symp. on Microarchitecture (MICRO)Cambridge, UK, Dec. 2014

Cornell University Christopher Batten 7 / 19

Page 8: Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •

Inter-Iteration Loop Dependence Patterns

inst0inst1inst2inst3...branch

Iteration0

inst0inst1inst2inst3...branch

Iteration1

inst0inst1inst2inst3...branch

Iteration2

inst0inst1inst2inst3...branch

Iteration3

inst0inst1inst2inst3...branch

Iterationn-1

Explicit loop specialization (XLOOPS) elegantly encodes inter-iterationloop dependence patterns in the instruction set and enables

traditional, specialized, and adaptive execution.

Cornell University Christopher Batten 8 / 19

Page 9: Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •

XLOOPS ISA: Unordered Concurrent

for ( i=0; i<N; i++ )

C[i] = A[i] * B[i]

loop:

lw r2, 0(rA)

lw r3, 0(rB)

mul r4, r2, r3

sw r4, 0(rC)

addiu.xi rA, 4

addiu.xi rB, 4

addiu.xi rC, 4

addiu r1, r1, 1

xloop.uc r1, rN, loop

Iteration 0

inst0inst1inst2inst3...xloop.uc

Iteration 1

inst0inst1inst2inst3...xloop.uc

Iteration 2

inst0inst1inst2inst3...xloop.uc

Iteration 3

inst0inst1inst2inst3...xloop.uc

Cornell University Christopher Batten 9 / 19

Page 10: Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •

XLOOPS ISA: Ordered-Through-Registers

for ( X=0, i=0; i<N; i++ )

X += A[i]

B[i] = X

loop:

lw r2, 0(rA)

addu rX, r2, rX

sw rX, 0(rB)

addiu.xi rA, 4

addiu.xi rB, 4

addiu r1, r1, 1

xloop.or r1, rN, loop

Iteration 0

inst0inst1inst2inst3...xloop.or

Iteration 1

inst0inst1inst2inst3...xloop.or

Iteration 2

inst0inst1inst2inst3...xloop.or

Iteration 3

inst0inst1inst2inst3...xloop.or

Cornell University Christopher Batten 10 / 19

Page 11: Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •

XLOOPS ISA: Ordered-Through-Memory

for ( i=K; i<N; i++ )

A[i] = A[i] * A[i-K]

# r1 = rK

# r3 = rA + 4*rK

loop:

lw r4, 0(r3)

lw r5, 0(rA)

mul r6, r4, r5

sw r6, 0(r3)

addiu.xi r3, 4

addiu.xi rA, 4

addiu r1, r1, 1

xloop.om r1, rN, loop

Iteration 0

inst0inst1inst2inst3...xloop.om

Iteration 1

inst0inst1inst2inst3...xloop.om

Iteration 2

inst0inst1inst2inst3...xloop.om

Iteration 3

inst0inst1inst2inst3...xloop.om

Cornell University Christopher Batten 11 / 19

Page 12: Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •

XLOOPS ISA: Unordered Atomic

for ( i=0; i<N; i++ )

B[ A[i] ]++

D[ C[i] ]++

loop:

lw r6, 0(rA)

lw r7, 0(r6)

addiu r7, r7, 1

sw r7, 0(r6)

addiu.xi rA, rA, 4

...

addiu r1, r1, 1

xloop.ua r1, rN, loop

Iteration 0

inst0inst1inst2inst3...xloop.ua

Iteration 1

inst0inst1inst2inst3...xloop.ua

Iteration 2

inst0inst1inst2inst3...xloop.ua

Iteration 3

inst0inst1inst2inst3...xloop.ua

Cornell University Christopher Batten 12 / 19

Page 13: Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •

XLOOPS ISA: Dynamic Bound

Iteration 0

inst0inst1inst2inst3...xloop.db

Iteration 1

inst0inst1inst2inst3...xloop.db

Iteration 2

inst0inst1inst2inst3...xloop.db

Iteration 3

inst0inst1inst2inst3...xloop.db

Iteration 4

inst0inst1inst2inst3...xloop.db

Iteration 5

inst0inst1inst2inst3...xloop.db

Cornell University Christopher Batten 13 / 19

Page 14: Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •

XLOOPS Compiler

Kernel implementing Floyd-Warshall shortest path algorithm

for ( int k = 0; k < n; k++ )

#pragma xloops ordered

for ( int i = 0; i < n; i++ )

#pragma xloops unordered

for ( int j = 0; j < n; j++ )

path[i][j] = min( path[i][j], path[i][k] + path[k][j] );

Kernel implementing greedy algorithm for maximal matching

#pragma xloops ordered

for ( int i = 0; i < num_edges; i++ )

int v = edges[i].v; int u = edges[i].u;

if ( vertices[v] < 0 && vertices[u] < 0 )

vertices[v] = u; vertices[u] = v; out[k++] = i;

Cornell University Christopher Batten 14 / 19

Page 15: Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •

XLOOPS Microarchitecture

GPR RF32 × 32b

2r2w

GPP

LLFU

D$ Request/Response Crossbar

L1 I$ 16 KB

L2 Request and Response Crossbars

L1 D$ 16 KB

SLFU

Lane3

Lane1

Lane RF24 × 32b

2r2w

Inst Buf128×

Lane RF24 × 32b

2r2w

Inst Buf128×

Lane RF24 × 32b

2r2w

Inst Buf128×

Lane0

SLFU SLFU SLFU

IDQ

Lane Management Unit

IDQ IDQ

CIB 8×CIB 8×CIB 8×

LSQ16×

LSQ16×

LSQ16×

DBN

Lane Management Unit I Elegant hardware/softwareabstraction enables efficienttraditional execution ongeneral-purpose processor

I Loop-pattern specialization unitenables specialized execution forimproved performance andefficiency

I Adaptive execution uses twoprofiling phases to adaptivelydetermine best location to runloop

Cornell University Christopher Batten 15 / 19

Page 16: Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •

XLOOPS Cycle-Level Performance Results

rgb2

cmyk

-uc

sgem

m-u

c

ssea

rch-

uc

sym

m-u

c

vite

rbi-u

c

war

-uc

adpc

m-o

r

cova

r-or

dith

er-o

r

kmea

ns-o

r

sha-

or

sym

m-o

r

dynp

rog-

om

knn-

om

ksac

k-sm

-om

ksac

k-lg-o

m

war

-om

mm

-orm

sten

cil-o

rm

btre

e-ua

hsor

t-ua

huffm

an-u

a

rsor

t-ua

bfs-

uc-d

b

qsor

t-uc-

db

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Sp

ee

du

p N

orm

aliz

ed

to

S

imp

le R

ISC

Co

re

OOO 2-Way OOO 4-Way OOO 2-Way with LPSU

xloop.uc xloop.or xloop.{om,orm} xloop.ua xloop.*.db

I XLOOPS vs. Simple Core : Always higher performanceI XLOOPS vs. OOO 2-way : Generally competitive or higher performanceI XLOOPS vs. OOO 4-way : Mixed results

Cornell University Christopher Batten 16 / 19

Page 17: Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •

XLOOPS Cycle-Level Energy-Efficiency Results

XLOOPS vs.Simple RISC Core

XLOOPS vs.OOO 2-Way

XLOOPS vs.OOO 4-Way

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0Normalized Performance

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Norm

aliz

ed E

nerg

y E

ffic

iency

0.5 1.0 1.5 2.0 2.5 3.0Normalized Performance

0.5 1.0 1.5 2.0 2.5Normalized Performance

I XLOOPS vs. Simple Core : Competitive energy efficiencyI XLOOPS vs. OOO 2-way : Always more energy efficientI XLOOPS vs. OOO 4-way : Always more energy efficient

Cornell University Christopher Batten 17 / 19

Page 18: Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns •

DCache16KB SRAM for Cache Lines

DCacheTags

ICacheTags

ICache16KB SRAM for Cache Lines

L0Instr

Buffer

L0Instr

Buffer

L0Instr

Buffer

L0Instr

Buffer

Loop PatternSpecialization Unit

ScalarProcessor

32b IEEEFloating Point Unit

32b IntegerMul/Div Unit

VLSIImplementation

I TSMC 40 nmstandard-cell-basedimplementation

I RISC scalarprocessor with4-lane LPSU

I Supports xloop.uc

I Implemented inTSMC 40 nm usingASIC CAD toolflow

I ≈40% extra areacompared to simpleRISC processor

Cornell University Christopher Batten 18 / 19

Page 19: Christopher Batten - Cornell Universitycbatten/pdfs/batten-xloops-afrl2014.pdfGPGPU Microarchitecture XLOOPS: Specialization for Loop Patterns XPC: Explicit-Parallel-Call Architectures

Batten Research Group Overview XLOOPS: Specialization for Loop Patterns

Take-Away Points

I We envision future computing systems usingfine-grain heterogeneous architectures with a mixof tiles optimized for latency, throughput, classesof applications, or even a specific application

I We are using a vertically integrated researchmethodology to explore a variety of projects thatwill hopefully enable such fine-grainheterogeneous architectures

Cornell University Christopher Batten 19 / 19