34
Floa%ng Point Architecture Extensions for Op%mized Matrix Factoriza%on Ardavan Pedram Robert van de Geijn Andreas Gerstlauer

Floa%ng(Point(Architecture(Extensions( …2,2) PE (2,3) PE (3,0) PE (3,1) PE (3,2) PE (3,3) ` MEM B Address Regs Row Bus Write Column Bus Write A B µ programmed Controller Column

Embed Size (px)

Citation preview

Floa%ng(Point(Architecture(Extensions(

for(Op%mized(Matrix(Factoriza%on(

(

Ardavan(Pedram(

Robert(van(de(Geijn(

Andreas(Gerstlauer(

(P

OPTIMMUS

Outline(

•  Mo%va%on(and(Vision(

•  Related(Work(and(Background(

•  Base(Architecture(•  Factoriza%on(Algorithms(

•  Floa%ng(Point(Unit(Extensions(•  Experimental(Results(

•  Conclusion(and(Future(Work(

4/12/13( 2(

Outline(

•  Mo%va%on(and(Vision(

•  Related(Work(and(Background(

•  Base(Architecture(•  Factoriza%on(Algorithms(

•  Floa%ng(Point(Unit(Extensions(•  Experimental(Results((

•  Conclusion(and(Future(Work(

4/12/13( 3(

Floa%ng(Point(and(Numerical(Stability(

•  Floa%ngOpoint(Representa%on(is(Limited((

•  Numerical(Algorithms(on(CPUs(

–  Ensure(numerical(stability(

–  Avoid(overflow(and(underflow(–  Introduce(extra(complexity(on(the(

cri%cal(path(of(algorithm(

–  Sacrifice(performance(for(numerical(

stability(

•  Extra(opera%ons(

4/12/13( 4(

Codesign(is(a(Solu%on(

•  Our(Goal((Three(matrix(factoriza%ons()(

– Provide(hardware(support(– Reduce(complexity(in(the(algorithm(

(

•  Solu%on(– Provide(extensions(to(the(floa%ngOpoint(unit(– Simplify(the(algorithm(exploi%ng(extended(

architecture(

4/12/13( 5(

Outline(

•  Mo%va%on(and(Vision(

•  Related(Work(and(Background(

•  Base(Architecture(•  Factoriza%on(Algorithms(

•  Floa%ng(Point(Unit(Extensions(•  Experimental(Results(

•  Conclusion(and(Future(Work(

4/12/13( 6(

Related(work(

•  Hybrid(Solu%ons(–  [Volkov’08](((

•  Cholesky,(LU,(and(QR(–  [Agullo’11]((

•  Mul%Ocore(Mul%OGPU(

•  QR(

•  En%rely(on(GPU(–  [Kerr’09]((

•  Householder(QR(–  [Galoppo’05]((

•  LU(

•  FPGAs(–  [Wu’09]((

•  LU(without(pivo%ng(–  LAPACKrc([Gonzalez’09](

•  Customized(solvers(

–  [Tai’11](•  Scalable(QR(

–  [Aslan’09](•  Householder(QR(•  Unified(unit(

4/12/13( 7(

Outline(

•  Mo%va%on(and(Vision(

•  Related(Work(and(Background(

•  Base(Architecture(•  Factoriza%on(Algorithms(

•  Floa%ng(Point(Unit(Extensions(•  Experimental(Results(

•  Conclusion(and(Future(Work(

4/12/13( 8(

The(Linear(Algebra(Core((LAC)([Pedram’11](

•  Scalable(2OD(Array(of(nr×nr(Processing(Elements((PEs)(

–  Up(to(50(GFLOPS/W(efficiency(@(1GHz(

–  Specialized(floa%ngOpoint(units(with(1(MAC/cycle(throughput([Vangal’06](

–  Broadcast(buses((no(need(to(pipeline(up(to(nr=16)(–  Distributed(memory(architecture(

–  Distributed,(PEOlocal(microOprogrammed(control(4/12/13( 9(

PE(0,0)

PE(0,1)

PE(0,2)

PE(0,3)

PE(1,0)

PE(1,1)

PE(1,2)

PE(1,3)

PE(2,0)

PE(2,1)

PE(2,2)

PE(2,3)

PE(3,0)

PE(3,1)

PE(3,2)

PE(3,3)

`

MEM B

Address Regs

Row Bus Write

Column Bus Write

A B

µ programmed Controller

Column Bus Read

Row Bus Read

MACAccumulator

Cin

Memory Interface

RFMEM A

`

MEM B

Address Regs

Column Bus Write

A B

µ programmed Controller

MACAccumulator

Cin

RFMEM A

Outline(

•  Mo%va%on(and(Vision(

•  Related(Work(and(Background(

•  Base(Architecture(•  Factoriza%on(Algorithms(

•  Floa%ng(Point(Unit(Extensions(•  Experimental(Results((

•  Conclusion(and(Future(Work(

4/12/13( 10(

Matrix(Factoriza%ons(

•  LU(Factoriza%on(with(Par%al(Pivo%ng(–  Gaussian(elimina%on((

–  Par%al(pivo%ng(due(to(finite(precision(of(floa%ngOpoint(arithme%c(•  avoid(element(growth((

•  avoid(catastrophic(cancela%on(•  avoid(divide(by(zero((

•  Cholesky(Factoriza%on(–  Special(case(of(LU(for(Symmetric(Posi%ve(Defini%ve((SPD)(Matrices(

–  Half(the(FLOPS(of(LU(

•  QR(Factoriza%on((Householder(Reflec%ons)(–  Most(stable(solu%on(

–  Twice(the(FLOPS(of(LU((

4/12/13( 11(

LU(with(Par%al(Pivo%ng((

•  A = LU –  Square A�Rn×n –  Unit triangular L�Rn×n –  Upper triangular U�Rn×n

•  Solution by recursively partitioning

•  Rearrange((pivot)(the(rows(of(the(matrix(as(the(computa%on(

unfolds((

(4/12/13( 12(

LU Factorization

Update Source 1

GEMMUpdate

TRSM Update

Computed L

Not Yet Computed

A22

A00

A11

A20

A02

A12A10

A01

A21

(1) A11:=A11-A10A01 (2) A21:=A21-A20A01(3)

A11

(6) A12:=A12TRIL(A11)-1

L11

A21

A11Update

Source 2 A10

A20 A21

A01

A11

A12

A11

A12:=LU, p1

A22

A00

A20

A02

A12A10

A01

A21 A22

A00

A20

A02

A12A10

A01

A21

L11

U11

U00

(5) A12:=A12-A10A02

Interchange Rows

A11

A12

A11

A12:=P(p1)

A11

A12

A11

A12(4)

A00 A02

Computed U

A00

A22

A12

p

p0

p1

p2

A01 A02

A20

A10

A21 A22

A10

A20

A00 A02

A22

A12 A11

p

p0

p1

p2

LU(with(Par%al(Pivo%ng((

4/12/13( 13(

A =

✓↵11 aT12a21 A22

L =

✓1 0l21 L22

◆U =

✓�11 uT

12

0 U22

↵11 = �11 aT12 = uT12

a21 = �11l21 A22 � l21uT12 = L22U22

↵11 := �11 (no-op) aT12 := uT12 (no-op)

a21 := l21 = a21/�11 A22 := LU(A22 � l21uT12)

Par%%on(A,(L,&and&U:(

α11(and(ν11(are(scalars(and(A=LU&means:(✓

↵11 aT12a21 A22

◆=

✓�11 uT

12

l21�11 L22U22 + l21uT12

Which in turn means:

We can thus compute L and U in place for matrix A:

LU(with(Par%al(Pivo%ng(on(LAC((1)(

4/12/13( 14(

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Find the Pivot Feed 1/x Broadcast Interchange Rows

(1) (4)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Find maximum produced value in ith

column

Rank-1 update

(2)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Interchange the pivot row with ith row

(3)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Scale the ith column with pivot

1/x1/x

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

(S1) (S4)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Find maximum in ith column Rank-1 update (S2)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Interchange the pivot row with ith row (S3)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Scale the ith column with pivot

1/x 1/x

(S1) (S2) (S2) (S3,S4)(S1) (S2) (S2) (S3) (S4)

4/12/13( 15(

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Find the Pivot Feed 1/x Broadcast Interchange Rows

(1) (4)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Find maximum produced value in ith

column

Rank-1 update

(2)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Interchange the pivot row with ith row

(3)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Scale the ith column with pivot

1/x1/x

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

(S1) (S4)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Find maximum in ith column Rank-1 update (S2)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Interchange the pivot row with ith row (S3)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Scale the ith column with pivot

1/x 1/x

(S1) (S2) (S2) (S3,S4)(S1) (S2) (S2) (S3) (S4)

LU(with(Par%al(Pivo%ng(on(LAC((2)(

4/12/13( 16(

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Find the Pivot Feed 1/x Broadcast Interchange Rows

(1) (4)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Find maximum produced value in ith

column

Rank-1 update

(2)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Interchange the pivot row with ith row

(3)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Scale the ith column with pivot

1/x1/x

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

(S1) (S4)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Find maximum in ith column Rank-1 update (S2)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Interchange the pivot row with ith row (S3)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Scale the ith column with pivot

1/x 1/x

(S1) (S2) (S2) (S3,S4)(S1) (S2) (S2) (S3) (S4)

LU(with(Par%al(Pivo%ng(on(LAC((3)(

4/12/13( 17(

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Find the Pivot Feed 1/x Broadcast Interchange Rows

(1) (4)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Find maximum produced value in ith

column

Rank-1 update

(2)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Interchange the pivot row with ith row

(3)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Scale the ith column with pivot

1/x1/x

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

(S1) (S4)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Find maximum in ith column Rank-1 update (S2)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Interchange the pivot row with ith row (S3)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Scale the ith column with pivot

1/x 1/x

(S1) (S2) (S2) (S3,S4)(S1) (S2) (S2) (S3) (S4)

LU(with(Par%al(Pivo%ng(on(LAC((4)(

4/12/13( 18(

•  Pivoting on the Panel to Avoid –  Division by zeros on diagonal –  Element growth

•  Sequential Algorithm –  Except for GEMM updates

•  Special Demands –  Reciprocal (1/X) unit –  Find maximum element of a vector

LU(with(Par%al(Pivo%ng(on(LAC(

Cholesky(on(LAC(

4/12/13( 19(

Sub

LAPACK

Sub

BLASPacking

Host OS

LAPHost Processor

BLAS

LAPACK

LA Library

Host Application

Assembly Code Device Driver

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

1/x

(S1)

Broadcast

MultFeed 1/x

Broadcast

Inverse

Broadcast

TRSM Coef

(S2) (S3) (S3)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)(1)

(2)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

Broadcast a1 ! Broadcast a1

Broadcast a0T

Rank-1 update C+=a0a0T

(S1)

(S1)

(S1)

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

1/!

(b)

c

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(2,1) PE(2,2) PE(2,3)

PE(3,0) PE(3,1) PE(3,2) PE(3,3)

1/!x

(S1)

Broadcast

MultFeed 1/!x

Broadcast

Inverse Sqrt

(S2) (S3)

•  A = LLT –  Symmetric Positive Definite (SPD)

matrix A�Rn×n

–  Lower triangular matrix L�Rn×n

•  Sequential Algorithm –  Except for GEMM updates

•  Simple Control

–  7-stage state machine in PEs

•  Special Demands –  Inverse Square-root unit –  Square root unit

Householder(QR(

•  A = QR (Householder) –  Square matrix A�Rn×n –  Orthogonal matrix Q�Rn×n –  Upper triangular matrix R�Rn×n

•  Householder Transformation –  Vector-Norm is the key

operation

•  Special Demands –  Normalization

4/12/13( 20(

x = �0, · · · ,�n�1

y = x/t = 1/t ⇤ (�0, · · · ,�n�1);

kxk2 := t⇥ kyk2

kxk2 =q

20 + �

22 + . . .+ �

2n�1

t = max |xi| ;

Outline(

•  Mo%va%on(and(Vision(

•  Related(Work(and(Background(

•  Base(Architecture(•  Factoriza%on(Algorithms(

•  Floa%ng(Point(Unit(Extensions(•  Experimental(Results(

•  Conclusion(and(Future(Work(

4/12/13( 21(

Summary(of(Extensions(

•  Cholesky(–  Inverse(squareOroot((–  SquareOroot((

(

•  LU(with(Par%al(Pivo%ng(– Maximum(finder(

–  Reciprocal((

•  QR((VectorONorm)(

–  Op%on(1:(•  Maximum(finder(

•  Reciprocal((

–  Op%on(2:((•  Extend(the(floa%ngOpoint(representa%on(

•  Add(one(extra(bit(to(the(exponent(

4/12/13( 22(

Modifica%ons(to(the(LAC(

•  Special(Func%on(Unit((SFU)(

–  Inverse(squareOroot(–  SquareOroot(–  Reciprocal((1/X)(

•  FP(MAC((

– Maximum(Finder(

–  Exponent(Extension(

4/12/13( 23(

L2 Cache

(256 KB)

OOO exec logic

Branch predictor

Fetch/Decode

L1 Instruction Cache

(16 KB)

L1 Data Cache

(16 KB)

L3 Cache

(8 MB)

shared across

cores

Data Path

FU FU !!.. FU

Register

File

Load/

Store

Unit

SIMD ALU

Texture Cache

(read-only)

Fetch/Decode

Shared Local

storage

/

L1 Cache

(64 KB)

L2 Cache

(~1 MB)

Execution Contexts

(128 KB)

ALU1 ALU2 ALU3 ALU4

ALU5 ALU6 ALU7 ALU8

Fetch/Decode

Instruction

Cache

(8 KB)

Data Cache

(4 KB)

Multi-threaded

Instruction Issue

SIMD Poly Execution Unit

PE1

ALU

Mono Execution Unit

PE2 PE96

AL

U

FP

Ad

d

FP

Mu

l

Div

, "

MA

C1

6

Register File

(128 Bytes)

SRAM

(6 KB)

I/O

LAC

PE1 FP MAC

SRAM

(16 KB)

PE2 PE3 PE4

PE5 PE6 PE7 PE8

PE9 PE10 PE11 PE12

PE13 PE14 PE15 PE16

Controller

SFU

One Bank of On-Chip Memory

(512 KB)

Dedicated to Core

On-Chip Memory

(Less than 512 KB)

Shared Between Cores

Fetch/Decode

L1 Cache

(32 KB)

L2 Cache

(512 KB)

SPE1

PowerPC Core

SPE2 SPE8

Register File

(2 MB)

Local Store

(256 KB)

DMA

Fetch/Decode

FU FU Load/Store

UnitFU FU

SIMD ALU

Branch Predictor

SIMD ALU/FPU

Load/Store

Unit

Register File

25 GB/sec

To memory

25 GB/sec

To IO

Interconnect Bus 204 GB/sec

(a) (b)

(b)(a)

(

(

(

(

Special(Func%on(Unit((SFU)([Piñeiro’02](

•  Supported(opera%ons(–  Division(–  Reciprocal(–  SquareOroot(–  Invert(squareOroot(

•  Implementa%on(–  Mul%plica%ve(method(–  29O30(bit(approxima%on(lookOup(

–  Single(itera%on(of(modified(

Goldschmidt(

4/12/13( 24(

Look-Up Tables

1/Sqrt(X)

Look-Up Tables

1/X

Squaring

CS2D

FusedAccumulation Tree

CS2D

V=1-RW

CPA

G=RH

Z=G+GV

Look-Up Tables

1/Sqrt(X)

Look-Up Tables

1/X

Squaring

CS2D

FusedAccumulation Tree

CS2DCPA

MAC

Mac Input Select Logic

X1 X2 X1 X2

Ct0

Y X

Ct0

Ct1

Ct1

Ct0

Ct0Ct1

Y X

X

Ct0

Multiplier

AlignmentShift

Expcomparison

Exp Adder

Accumulator

NormalizationCorrection

Max Exp Max Mantissa

ComparatorLogic

EA

Exp Control Shift

Sign Inversion

EB MA MBEC MC

(b) (c)(a)

Merged(SFU(with(MAC(

4/12/13( 25(

Look-Up Tables

1/Sqrt(X)

Look-Up Tables

1/X

Squaring

CS2D

FusedAccumulation Tree

CS2D

V=1-RW

CPA

G=RH

Z=G+GV

Look-Up Tables

1/Sqrt(X)

Look-Up Tables

1/X

Squaring

CS2D

FusedAccumulation Tree

CS2DCPA

MAC

Mac Input Select Logic

X1 X2 X1 X2

Ct0

Y X

Ct0

Ct1

Ct1

Ct0

Ct0Ct1

Y X

X

Ct0

Multiplier

AlignmentShift

Expcomparison

Exp Adder

Accumulator

NormalizationCorrection

Max Exp Max Mantissa

ComparatorLogic

EA

Exp Control Shift

Sign Inversion

EB MA MBEC MC

(b) (c)(a)

•  For(Linear(Algebra(–  No(need(for(high(throughput(–  No(resource(conflict(

•  U%lize(the(Exis%ng(MAC(Unit(

•  Save(Area(and(Power(•  Augment(Diagonal(PEs(

–  Avoid(extra(communica%ons(

–  Simpler(control(

Extensions(to(the(Base(MAC(in([Jain’10](

•  LU(with(Pivo%ng(–  Keep(the(maximum(value(produced(

so(far(•  Reduce(the(search(space(for(pivot(•  Only(nr(values(to(compare(among(PEs(

–  Simple(logic(for(comparator(•  Ac%vated(with(normaliza%on(

•  Not(on(the(cri%cal(path(•  Causes(no(extra(delay(

•  VectorOnorm(–  An(extra(exponent(bit((gray)(–  Result(feeds(SFU(–  SFU(produces(typical(FP(value(

4/12/13( 26(

Look-Up Tables

1/Sqrt(X)

Look-Up Tables

1/X

Squaring

CS2D

FusedAccumulation Tree

CS2D

V=1-RW

CPA

G=RH

Z=G+GV

Look-Up Tables

1/Sqrt(X)

Look-Up Tables

1/X

Squaring

CS2D

FusedAccumulation Tree

CS2DCPA

MAC

Mac Input Select Logic

X1 X2 X1 X2

Ct0

Y X

Ct0

Ct1

Ct1

Ct0

Ct0Ct1

Y X

X

Ct0

Multiplier

AlignmentShift

Expcomparison

Exp Adder

Accumulator

NormalizationCorrection

Max Exp Max Mantissa

ComparatorLogic

EA

Exp Control Shift

Sign Inversion

EB MA MBEC MC

(b) (c)(a)

Outline(

•  Mo%va%on(and(Vision(

•  Related(Work(and(Background(

•  Base(Architecture(•  Factoriza%on(Algorithms(

•  Floa%ng(Point(Unit(Extensions(•  Experimental(Results(

•  Conclusion(and(Future(Work(

4/12/13( 27(

Power(and(Performance(Analysis(

•  Component(Selec%ons(

–  @(1GHz(target(frequency(

–  MAC(units((45nm)([Galal’10](

(

–  Storage(model(with(CACTI(6.0(

•  Pure(SRAM(Model(

(

–  Interconnect(•  CACTI(6.0(•  AMBA(AHB([Lahiri’04](

•  [Wolkore.2009](

(4/12/13( 28(

•  Two(Case(for(MAC(Extension(–  Only(maximum(finder(

–  Extended(exponent(bit(

•  Three(Cases(of(SFU(Support(–  No(SFU((μOprogrammed(in(SW)(

–  Isolated(separate(SFU(–  Diagonal(PEs((

•  Merged(unit(

•  Simpler(control(

•  Unblocked(Inner(Kernels((–  nr×nr(Cholesky((–  k×nr((LU(and(QR((

•  k(=(64,(128,(256((

(

(

(

LU(with(Par%al(Pivo%ng(

•  Inner(Kernel(Size(k×nr:((–  k(=(64,(128,(256(

•  No(Benefit(From(SFU(–  Reciprocal(and(pivo%ng(

performed(concurrently(

•  Pivo%ng(is(Domina%ng(

•  (Benefits(of(Comparator((–  (20%(speed(up((–  15%(energy(savings(

4/12/13( 29(

0"

5"

10"

15"

20"

25"

30"

35"

SW" Isolate" Diag" SW" Isolate" Diag" SW" Isolate" Diag"

GOPS/W

"

Three"types"of"sqrt/division"units"with"kernel"heights"64,"128,"256"

LU"No"Ext"

LU+"Comparator"

0"

0.5"

1"

1.5"

2"

2.5"

3"

SW" Isolate" Diag" SW" Isolate" Diag" SW" Isolate" Diag"

GOPS/m

m^2"

Three"types"of"sqrt/division"unit"with"kernel"heights"64,"128,"256"

LU"No"Ext"

LU+"Comparator"

Vector(Norm(

•  Inner(Kernel(Size(k×nr:((–  k(=(64,(128,(256(

•  SFU(–  30%(speedup(–  10%(energy(savings(

•  Exponent(Bit(–  100%(speedup((–  60%(energy(savings(

4/12/13( 30(

0"

20"

40"

60"

80"

100"

120"

SW" Isolate" Diag" SW" Isolate" Diag" SW" Isolate" Diag"

GFLO

PS/W

"

Three"types"of"sqrt/division"units""with"kernel"heights"64,"128,"256"

Vnorm"No"Ext"

Vnorm+"Comparator"Vnorm+"Exp"Ext"

0"

2"

4"

6"

8"

10"

12"

14"

SW" Isolate" Diag" SW" Isolate" Diag" SW" Isolate" Diag"

GFLOPS/m

m^2"

Three"types"of"sqrt/division"units""with"kernel"heights"64,"128,"256"

Vnorm"No"Ext"

Vnorm+"Comparator"

Vnorm+"Exp"Ext"

Outline(

•  Mo%va%on(and(Vision(

•  Related(Work(and(Background(

•  Base(Architecture(•  Factoriza%on(Algorithms(

•  Floa%ng(Point(Unit(Extensions(•  Experimental(Results(

•  Conclusion(and(Future(Work(

(4/12/13( 31(

Conclusions(and(Future(Work((

•  Matrix(Factoriza%ons(–  Cholesky(–  LU(with(Par%al(Pivo%ng(–  Householder(QR(

•  FPMAC(Modifica%ons((–  Decrease/Facilitate(algorithm(

complexity(

•  Extension(–  Inverse(SquareOroot(–  Reciprocal(–  Extended(exponent(–  Maximum(finder(

!  Integra%on(of(the(Core(Into(a(Heterogeneous(System(

(

(

!  Block(Level(Algorithmic(Analyses(

(

(

!  Compare(TradeOoffs((–  Efficiency(vs.(Flexibility(

4/12/13( 32(

Thank(You(and(Welcome(to(Texas(

4/12/13( 33(

FPMAC[Vangal(et(al.2006](

4/12/13( 35(!

Figure 2. The reconfigurable Fused/Continuous MAC architecture and pipeline stage

"#$%$! %&'()*%! )+$! +$,&-$.! ,/! 0+$1&/2%! 0&0$! %,)'$! 3456! ,/!

+$.27$!,#$!7+&,&7)*!0),#!.$*)89!"#$!:4;!/<!,#$!)772-2*),/+!

/2,02,!+$0+$%$(,%!%&'(!/2,02,!/<!,#$!:=>!/0$+),&/(!)(.!&%!

2%$.! ,/! 7/(1$+,! ,#$! )772-2*),$.! -)(,&%%)! <+/-! ?@%!

7/-0*$-$(,! </+-!,/!%&'(!-)'(&,2.$! </+-9!"#$!(2-A$+!/<!

*$).&('! B$+/$%! 0+$%$(,! &(! ,#$! <&()*!-)(,&%%)! &%! .$,$+-&($.!

A8!,#$!C$).&('!D$+/!>/2(,$+!3CD>6!&(!0&0$!%,)'$!4EF!G#&7#!

7/2(,%! ,#$! B$+/$%! <+/-! :4;! ./G(! ,/! ,#$! <&+%,! /A%$+1$.!

,+)(%&,&/(!H!!!I! &(! ,#$! <&()*!-)(,&%%)9!;/,#!(/+-)*&B),&/(!

)(.!A)%$?!7/(1$+%&/(!)+$!)77/-0*&%#$.! &(! <&()*!0&0$!%,)'$!

4J9!!

!"#! $%&'()*+,-./0+1+234(*425%46784"#$! .$%&'($.! :=>! )+7#&,$7,2+$! 7)(! A$! +$7/(<&'2+$.! ,/!

0$+</+-! -2*,&0*$! /0$+),&/(%! /(! A/,#! <*/),&('K0/&(,! )(.!

&(,$'$+!/0$+)(.%9!L&<<$+$(,!-/.$%!/<!/0$+),&/(!)+$M!

3I6! N2%$.O>/(,&(2/2%! :=>! /0$+),&/(M! =! 7/(,+/*! %&'()*F!

!"#$%&'(%)*)%! %#/G(! &(!N&'2+$!?* &%!2%$.! ,/!7/(<&'2+$! ,#$!

:=>! ,/! 0$+</+-! $&,#$+! <2%$.! /+! 7/(,&(2/2%! :=>!

/0$+),&/(9!!

3?6! N*/),&('K0/&(,OP(,$'$+! :=>! /0$+),&/(M! =! 7/(,+/*!

%&'()*F! +,$"'-.'! 7/(<&'2+$%! ,#$!:=>! $('&($! ,/! /0$+),$! /(!

$&,#$+! %&('*$K0+$7&%&/(! <*/),&('K0/&(,! /+! IEKA&,! &(,$'$+!

/0$+)(.%9! P(!7)%$!/<! <*/),&('K0/&(,!:=>!/0$+),&/(F!)**! ,#$!

&(02,!Q?!A&,%!/<!%&('*$K0+$7&%&/(!<*/),&('K0/&(,!/0$+)(.%!)+$!

0)%%$.! )%! &(02,%9! L2+&('! ,#$! &(,$'$+! :=>! /0$+),&/(F! ,#$!

IEKA&,%!/<!&(,$'$+!&(02,%!)+$!%$*$7,&1$*8!0)%%$.!,#+/2'#!,#$!

&(02,! A&,%! )%! %#/G(! &(! N&'2+$! R9! "/! )77/-0*&%#! &(,$'$+!

-2*,&0*&7),&/(!G&,#/2,!)(8!0+$7&%&/(!*/%%F! ,#$!IRA&,! &(,$'$+!

&(02,%! =! )(.! ;! /<! ,#$! -2*,&0*&$+! )+$! )00*&$.! ,#+/2'#! ,#$!

:4;!IRA&,%!/<!,#$!-)(,&%%)%!=!)(.!;F!+$%0$7,&1$*89!!

4Figure 3. Accumulator mantissa datapath

!!!!!!!!!!!Figure 4. Accumulator exponent datapath

!"#

Authorized licensed use limited to: University of Texas at Austin. Downloaded on April 15,2010 at 21:26:16 UTC from IEEE Xplore. Restrictions apply.