Upload
phamkhanh
View
224
Download
3
Embed Size (px)
Citation preview
Floa%ng(Point(Architecture(Extensions(
for(Op%mized(Matrix(Factoriza%on(
(
Ardavan(Pedram(
Robert(van(de(Geijn(
Andreas(Gerstlauer(
(P
OPTIMMUS
Outline(
• Mo%va%on(and(Vision(
• Related(Work(and(Background(
• Base(Architecture(• Factoriza%on(Algorithms(
• Floa%ng(Point(Unit(Extensions(• Experimental(Results(
• Conclusion(and(Future(Work(
4/12/13( 2(
Outline(
• Mo%va%on(and(Vision(
• Related(Work(and(Background(
• Base(Architecture(• Factoriza%on(Algorithms(
• Floa%ng(Point(Unit(Extensions(• Experimental(Results((
• Conclusion(and(Future(Work(
4/12/13( 3(
Floa%ng(Point(and(Numerical(Stability(
• Floa%ngOpoint(Representa%on(is(Limited((
• Numerical(Algorithms(on(CPUs(
– Ensure(numerical(stability(
– Avoid(overflow(and(underflow(– Introduce(extra(complexity(on(the(
cri%cal(path(of(algorithm(
– Sacrifice(performance(for(numerical(
stability(
• Extra(opera%ons(
4/12/13( 4(
Codesign(is(a(Solu%on(
• Our(Goal((Three(matrix(factoriza%ons()(
– Provide(hardware(support(– Reduce(complexity(in(the(algorithm(
(
• Solu%on(– Provide(extensions(to(the(floa%ngOpoint(unit(– Simplify(the(algorithm(exploi%ng(extended(
architecture(
4/12/13( 5(
Outline(
• Mo%va%on(and(Vision(
• Related(Work(and(Background(
• Base(Architecture(• Factoriza%on(Algorithms(
• Floa%ng(Point(Unit(Extensions(• Experimental(Results(
• Conclusion(and(Future(Work(
4/12/13( 6(
Related(work(
• Hybrid(Solu%ons(– [Volkov’08](((
• Cholesky,(LU,(and(QR(– [Agullo’11]((
• Mul%Ocore(Mul%OGPU(
• QR(
• En%rely(on(GPU(– [Kerr’09]((
• Householder(QR(– [Galoppo’05]((
• LU(
• FPGAs(– [Wu’09]((
• LU(without(pivo%ng(– LAPACKrc([Gonzalez’09](
• Customized(solvers(
– [Tai’11](• Scalable(QR(
– [Aslan’09](• Householder(QR(• Unified(unit(
4/12/13( 7(
Outline(
• Mo%va%on(and(Vision(
• Related(Work(and(Background(
• Base(Architecture(• Factoriza%on(Algorithms(
• Floa%ng(Point(Unit(Extensions(• Experimental(Results(
• Conclusion(and(Future(Work(
4/12/13( 8(
The(Linear(Algebra(Core((LAC)([Pedram’11](
• Scalable(2OD(Array(of(nr×nr(Processing(Elements((PEs)(
– Up(to(50(GFLOPS/W(efficiency(@(1GHz(
– Specialized(floa%ngOpoint(units(with(1(MAC/cycle(throughput([Vangal’06](
– Broadcast(buses((no(need(to(pipeline(up(to(nr=16)(– Distributed(memory(architecture(
– Distributed,(PEOlocal(microOprogrammed(control(4/12/13( 9(
PE(0,0)
PE(0,1)
PE(0,2)
PE(0,3)
PE(1,0)
PE(1,1)
PE(1,2)
PE(1,3)
PE(2,0)
PE(2,1)
PE(2,2)
PE(2,3)
PE(3,0)
PE(3,1)
PE(3,2)
PE(3,3)
`
MEM B
Address Regs
Row Bus Write
Column Bus Write
A B
µ programmed Controller
Column Bus Read
Row Bus Read
MACAccumulator
Cin
Memory Interface
RFMEM A
`
MEM B
Address Regs
Column Bus Write
A B
µ programmed Controller
MACAccumulator
Cin
RFMEM A
Outline(
• Mo%va%on(and(Vision(
• Related(Work(and(Background(
• Base(Architecture(• Factoriza%on(Algorithms(
• Floa%ng(Point(Unit(Extensions(• Experimental(Results((
• Conclusion(and(Future(Work(
4/12/13( 10(
Matrix(Factoriza%ons(
• LU(Factoriza%on(with(Par%al(Pivo%ng(– Gaussian(elimina%on((
– Par%al(pivo%ng(due(to(finite(precision(of(floa%ngOpoint(arithme%c(• avoid(element(growth((
• avoid(catastrophic(cancela%on(• avoid(divide(by(zero((
• Cholesky(Factoriza%on(– Special(case(of(LU(for(Symmetric(Posi%ve(Defini%ve((SPD)(Matrices(
– Half(the(FLOPS(of(LU(
• QR(Factoriza%on((Householder(Reflec%ons)(– Most(stable(solu%on(
– Twice(the(FLOPS(of(LU((
4/12/13( 11(
LU(with(Par%al(Pivo%ng((
• A = LU – Square A�Rn×n – Unit triangular L�Rn×n – Upper triangular U�Rn×n
• Solution by recursively partitioning
• Rearrange((pivot)(the(rows(of(the(matrix(as(the(computa%on(
unfolds((
(4/12/13( 12(
LU Factorization
Update Source 1
GEMMUpdate
TRSM Update
Computed L
Not Yet Computed
A22
A00
A11
A20
A02
A12A10
A01
A21
(1) A11:=A11-A10A01 (2) A21:=A21-A20A01(3)
A11
(6) A12:=A12TRIL(A11)-1
L11
A21
A11Update
Source 2 A10
A20 A21
A01
A11
A12
A11
A12:=LU, p1
A22
A00
A20
A02
A12A10
A01
A21 A22
A00
A20
A02
A12A10
A01
A21
L11
U11
U00
(5) A12:=A12-A10A02
Interchange Rows
A11
A12
A11
A12:=P(p1)
A11
A12
A11
A12(4)
A00 A02
Computed U
A00
A22
A12
p
p0
p1
p2
A01 A02
A20
A10
A21 A22
A10
A20
A00 A02
A22
A12 A11
p
p0
p1
p2
LU(with(Par%al(Pivo%ng((
4/12/13( 13(
A =
✓↵11 aT12a21 A22
◆
L =
✓1 0l21 L22
◆U =
✓�11 uT
12
0 U22
◆
↵11 = �11 aT12 = uT12
a21 = �11l21 A22 � l21uT12 = L22U22
↵11 := �11 (no-op) aT12 := uT12 (no-op)
a21 := l21 = a21/�11 A22 := LU(A22 � l21uT12)
Par%%on(A,(L,&and&U:(
α11(and(ν11(are(scalars(and(A=LU&means:(✓
↵11 aT12a21 A22
◆=
✓�11 uT
12
l21�11 L22U22 + l21uT12
◆
Which in turn means:
We can thus compute L and U in place for matrix A:
LU(with(Par%al(Pivo%ng(on(LAC((1)(
4/12/13( 14(
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Find the Pivot Feed 1/x Broadcast Interchange Rows
(1) (4)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Find maximum produced value in ith
column
Rank-1 update
(2)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Interchange the pivot row with ith row
(3)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Scale the ith column with pivot
1/x1/x
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
(S1) (S4)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Find maximum in ith column Rank-1 update (S2)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Interchange the pivot row with ith row (S3)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Scale the ith column with pivot
1/x 1/x
(S1) (S2) (S2) (S3,S4)(S1) (S2) (S2) (S3) (S4)
4/12/13( 15(
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Find the Pivot Feed 1/x Broadcast Interchange Rows
(1) (4)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Find maximum produced value in ith
column
Rank-1 update
(2)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Interchange the pivot row with ith row
(3)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Scale the ith column with pivot
1/x1/x
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
(S1) (S4)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Find maximum in ith column Rank-1 update (S2)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Interchange the pivot row with ith row (S3)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Scale the ith column with pivot
1/x 1/x
(S1) (S2) (S2) (S3,S4)(S1) (S2) (S2) (S3) (S4)
LU(with(Par%al(Pivo%ng(on(LAC((2)(
4/12/13( 16(
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Find the Pivot Feed 1/x Broadcast Interchange Rows
(1) (4)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Find maximum produced value in ith
column
Rank-1 update
(2)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Interchange the pivot row with ith row
(3)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Scale the ith column with pivot
1/x1/x
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
(S1) (S4)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Find maximum in ith column Rank-1 update (S2)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Interchange the pivot row with ith row (S3)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Scale the ith column with pivot
1/x 1/x
(S1) (S2) (S2) (S3,S4)(S1) (S2) (S2) (S3) (S4)
LU(with(Par%al(Pivo%ng(on(LAC((3)(
4/12/13( 17(
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Find the Pivot Feed 1/x Broadcast Interchange Rows
(1) (4)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Find maximum produced value in ith
column
Rank-1 update
(2)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Interchange the pivot row with ith row
(3)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Scale the ith column with pivot
1/x1/x
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
(S1) (S4)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Find maximum in ith column Rank-1 update (S2)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Interchange the pivot row with ith row (S3)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Scale the ith column with pivot
1/x 1/x
(S1) (S2) (S2) (S3,S4)(S1) (S2) (S2) (S3) (S4)
LU(with(Par%al(Pivo%ng(on(LAC((4)(
4/12/13( 18(
• Pivoting on the Panel to Avoid – Division by zeros on diagonal – Element growth
• Sequential Algorithm – Except for GEMM updates
• Special Demands – Reciprocal (1/X) unit – Find maximum element of a vector
LU(with(Par%al(Pivo%ng(on(LAC(
Cholesky(on(LAC(
4/12/13( 19(
Sub
LAPACK
Sub
BLASPacking
Host OS
LAPHost Processor
BLAS
LAPACK
LA Library
Host Application
Assembly Code Device Driver
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
1/x
(S1)
Broadcast
MultFeed 1/x
Broadcast
Inverse
Broadcast
TRSM Coef
(S2) (S3) (S3)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)(1)
(2)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
Broadcast a1 ! Broadcast a1
Broadcast a0T
Rank-1 update C+=a0a0T
(S1)
(S1)
(S1)
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
1/!
(b)
c
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
1/!x
(S1)
Broadcast
MultFeed 1/!x
Broadcast
Inverse Sqrt
(S2) (S3)
• A = LLT – Symmetric Positive Definite (SPD)
matrix A�Rn×n
– Lower triangular matrix L�Rn×n
• Sequential Algorithm – Except for GEMM updates
• Simple Control
– 7-stage state machine in PEs
• Special Demands – Inverse Square-root unit – Square root unit
Householder(QR(
• A = QR (Householder) – Square matrix A�Rn×n – Orthogonal matrix Q�Rn×n – Upper triangular matrix R�Rn×n
• Householder Transformation – Vector-Norm is the key
operation
• Special Demands – Normalization
4/12/13( 20(
x = �0, · · · ,�n�1
y = x/t = 1/t ⇤ (�0, · · · ,�n�1);
kxk2 := t⇥ kyk2
kxk2 =q
�
20 + �
22 + . . .+ �
2n�1
t = max |xi| ;
Outline(
• Mo%va%on(and(Vision(
• Related(Work(and(Background(
• Base(Architecture(• Factoriza%on(Algorithms(
• Floa%ng(Point(Unit(Extensions(• Experimental(Results(
• Conclusion(and(Future(Work(
4/12/13( 21(
Summary(of(Extensions(
• Cholesky(– Inverse(squareOroot((– SquareOroot((
(
• LU(with(Par%al(Pivo%ng(– Maximum(finder(
– Reciprocal((
• QR((VectorONorm)(
– Op%on(1:(• Maximum(finder(
• Reciprocal((
– Op%on(2:((• Extend(the(floa%ngOpoint(representa%on(
• Add(one(extra(bit(to(the(exponent(
4/12/13( 22(
Modifica%ons(to(the(LAC(
• Special(Func%on(Unit((SFU)(
– Inverse(squareOroot(– SquareOroot(– Reciprocal((1/X)(
• FP(MAC((
– Maximum(Finder(
– Exponent(Extension(
4/12/13( 23(
L2 Cache
(256 KB)
OOO exec logic
Branch predictor
Fetch/Decode
L1 Instruction Cache
(16 KB)
L1 Data Cache
(16 KB)
L3 Cache
(8 MB)
shared across
cores
Data Path
FU FU !!.. FU
Register
File
Load/
Store
Unit
SIMD ALU
Texture Cache
(read-only)
Fetch/Decode
Shared Local
storage
/
L1 Cache
(64 KB)
L2 Cache
(~1 MB)
Execution Contexts
(128 KB)
ALU1 ALU2 ALU3 ALU4
ALU5 ALU6 ALU7 ALU8
Fetch/Decode
Instruction
Cache
(8 KB)
Data Cache
(4 KB)
Multi-threaded
Instruction Issue
SIMD Poly Execution Unit
PE1
ALU
Mono Execution Unit
PE2 PE96
AL
U
FP
Ad
d
FP
Mu
l
Div
, "
MA
C1
6
Register File
(128 Bytes)
SRAM
(6 KB)
I/O
LAC
PE1 FP MAC
SRAM
(16 KB)
PE2 PE3 PE4
PE5 PE6 PE7 PE8
PE9 PE10 PE11 PE12
PE13 PE14 PE15 PE16
Controller
SFU
One Bank of On-Chip Memory
(512 KB)
Dedicated to Core
On-Chip Memory
(Less than 512 KB)
Shared Between Cores
Fetch/Decode
L1 Cache
(32 KB)
L2 Cache
(512 KB)
SPE1
PowerPC Core
SPE2 SPE8
Register File
(2 MB)
Local Store
(256 KB)
DMA
Fetch/Decode
FU FU Load/Store
UnitFU FU
SIMD ALU
Branch Predictor
SIMD ALU/FPU
Load/Store
Unit
Register File
25 GB/sec
To memory
25 GB/sec
To IO
Interconnect Bus 204 GB/sec
(a) (b)
(b)(a)
(
(
(
(
Special(Func%on(Unit((SFU)([Piñeiro’02](
• Supported(opera%ons(– Division(– Reciprocal(– SquareOroot(– Invert(squareOroot(
• Implementa%on(– Mul%plica%ve(method(– 29O30(bit(approxima%on(lookOup(
– Single(itera%on(of(modified(
Goldschmidt(
4/12/13( 24(
Look-Up Tables
1/Sqrt(X)
Look-Up Tables
1/X
Squaring
CS2D
FusedAccumulation Tree
CS2D
V=1-RW
CPA
G=RH
Z=G+GV
Look-Up Tables
1/Sqrt(X)
Look-Up Tables
1/X
Squaring
CS2D
FusedAccumulation Tree
CS2DCPA
MAC
Mac Input Select Logic
X1 X2 X1 X2
Ct0
Y X
Ct0
Ct1
Ct1
Ct0
Ct0Ct1
Y X
X
Ct0
Multiplier
AlignmentShift
Expcomparison
Exp Adder
Accumulator
NormalizationCorrection
Max Exp Max Mantissa
ComparatorLogic
EA
Exp Control Shift
Sign Inversion
EB MA MBEC MC
(b) (c)(a)
Merged(SFU(with(MAC(
4/12/13( 25(
Look-Up Tables
1/Sqrt(X)
Look-Up Tables
1/X
Squaring
CS2D
FusedAccumulation Tree
CS2D
V=1-RW
CPA
G=RH
Z=G+GV
Look-Up Tables
1/Sqrt(X)
Look-Up Tables
1/X
Squaring
CS2D
FusedAccumulation Tree
CS2DCPA
MAC
Mac Input Select Logic
X1 X2 X1 X2
Ct0
Y X
Ct0
Ct1
Ct1
Ct0
Ct0Ct1
Y X
X
Ct0
Multiplier
AlignmentShift
Expcomparison
Exp Adder
Accumulator
NormalizationCorrection
Max Exp Max Mantissa
ComparatorLogic
EA
Exp Control Shift
Sign Inversion
EB MA MBEC MC
(b) (c)(a)
• For(Linear(Algebra(– No(need(for(high(throughput(– No(resource(conflict(
• U%lize(the(Exis%ng(MAC(Unit(
• Save(Area(and(Power(• Augment(Diagonal(PEs(
– Avoid(extra(communica%ons(
– Simpler(control(
Extensions(to(the(Base(MAC(in([Jain’10](
• LU(with(Pivo%ng(– Keep(the(maximum(value(produced(
so(far(• Reduce(the(search(space(for(pivot(• Only(nr(values(to(compare(among(PEs(
– Simple(logic(for(comparator(• Ac%vated(with(normaliza%on(
• Not(on(the(cri%cal(path(• Causes(no(extra(delay(
• VectorOnorm(– An(extra(exponent(bit((gray)(– Result(feeds(SFU(– SFU(produces(typical(FP(value(
4/12/13( 26(
Look-Up Tables
1/Sqrt(X)
Look-Up Tables
1/X
Squaring
CS2D
FusedAccumulation Tree
CS2D
V=1-RW
CPA
G=RH
Z=G+GV
Look-Up Tables
1/Sqrt(X)
Look-Up Tables
1/X
Squaring
CS2D
FusedAccumulation Tree
CS2DCPA
MAC
Mac Input Select Logic
X1 X2 X1 X2
Ct0
Y X
Ct0
Ct1
Ct1
Ct0
Ct0Ct1
Y X
X
Ct0
Multiplier
AlignmentShift
Expcomparison
Exp Adder
Accumulator
NormalizationCorrection
Max Exp Max Mantissa
ComparatorLogic
EA
Exp Control Shift
Sign Inversion
EB MA MBEC MC
(b) (c)(a)
Outline(
• Mo%va%on(and(Vision(
• Related(Work(and(Background(
• Base(Architecture(• Factoriza%on(Algorithms(
• Floa%ng(Point(Unit(Extensions(• Experimental(Results(
• Conclusion(and(Future(Work(
4/12/13( 27(
Power(and(Performance(Analysis(
• Component(Selec%ons(
– @(1GHz(target(frequency(
– MAC(units((45nm)([Galal’10](
(
– Storage(model(with(CACTI(6.0(
• Pure(SRAM(Model(
(
– Interconnect(• CACTI(6.0(• AMBA(AHB([Lahiri’04](
• [Wolkore.2009](
(4/12/13( 28(
• Two(Case(for(MAC(Extension(– Only(maximum(finder(
– Extended(exponent(bit(
• Three(Cases(of(SFU(Support(– No(SFU((μOprogrammed(in(SW)(
– Isolated(separate(SFU(– Diagonal(PEs((
• Merged(unit(
• Simpler(control(
• Unblocked(Inner(Kernels((– nr×nr(Cholesky((– k×nr((LU(and(QR((
• k(=(64,(128,(256((
(
(
(
LU(with(Par%al(Pivo%ng(
• Inner(Kernel(Size(k×nr:((– k(=(64,(128,(256(
• No(Benefit(From(SFU(– Reciprocal(and(pivo%ng(
performed(concurrently(
• Pivo%ng(is(Domina%ng(
• (Benefits(of(Comparator((– (20%(speed(up((– 15%(energy(savings(
4/12/13( 29(
0"
5"
10"
15"
20"
25"
30"
35"
SW" Isolate" Diag" SW" Isolate" Diag" SW" Isolate" Diag"
GOPS/W
"
Three"types"of"sqrt/division"units"with"kernel"heights"64,"128,"256"
LU"No"Ext"
LU+"Comparator"
0"
0.5"
1"
1.5"
2"
2.5"
3"
SW" Isolate" Diag" SW" Isolate" Diag" SW" Isolate" Diag"
GOPS/m
m^2"
Three"types"of"sqrt/division"unit"with"kernel"heights"64,"128,"256"
LU"No"Ext"
LU+"Comparator"
Vector(Norm(
• Inner(Kernel(Size(k×nr:((– k(=(64,(128,(256(
• SFU(– 30%(speedup(– 10%(energy(savings(
• Exponent(Bit(– 100%(speedup((– 60%(energy(savings(
4/12/13( 30(
0"
20"
40"
60"
80"
100"
120"
SW" Isolate" Diag" SW" Isolate" Diag" SW" Isolate" Diag"
GFLO
PS/W
"
Three"types"of"sqrt/division"units""with"kernel"heights"64,"128,"256"
Vnorm"No"Ext"
Vnorm+"Comparator"Vnorm+"Exp"Ext"
0"
2"
4"
6"
8"
10"
12"
14"
SW" Isolate" Diag" SW" Isolate" Diag" SW" Isolate" Diag"
GFLOPS/m
m^2"
Three"types"of"sqrt/division"units""with"kernel"heights"64,"128,"256"
Vnorm"No"Ext"
Vnorm+"Comparator"
Vnorm+"Exp"Ext"
Outline(
• Mo%va%on(and(Vision(
• Related(Work(and(Background(
• Base(Architecture(• Factoriza%on(Algorithms(
• Floa%ng(Point(Unit(Extensions(• Experimental(Results(
• Conclusion(and(Future(Work(
(4/12/13( 31(
Conclusions(and(Future(Work((
• Matrix(Factoriza%ons(– Cholesky(– LU(with(Par%al(Pivo%ng(– Householder(QR(
• FPMAC(Modifica%ons((– Decrease/Facilitate(algorithm(
complexity(
• Extension(– Inverse(SquareOroot(– Reciprocal(– Extended(exponent(– Maximum(finder(
! Integra%on(of(the(Core(Into(a(Heterogeneous(System(
(
(
! Block(Level(Algorithmic(Analyses(
(
(
! Compare(TradeOoffs((– Efficiency(vs.(Flexibility(
4/12/13( 32(
FPMAC[Vangal(et(al.2006](
4/12/13( 35(!
Figure 2. The reconfigurable Fused/Continuous MAC architecture and pipeline stage
"#$%$! %&'()*%! )+$! +$,&-$.! ,/! 0+$1&/2%! 0&0$! %,)'$! 3456! ,/!
+$.27$!,#$!7+&,&7)*!0),#!.$*)89!"#$!:4;!/<!,#$!)772-2*),/+!
/2,02,!+$0+$%$(,%!%&'(!/2,02,!/<!,#$!:=>!/0$+),&/(!)(.!&%!
2%$.! ,/! 7/(1$+,! ,#$! )772-2*),$.! -)(,&%%)! <+/-! ?@%!
7/-0*$-$(,! </+-!,/!%&'(!-)'(&,2.$! </+-9!"#$!(2-A$+!/<!
*$).&('! B$+/$%! 0+$%$(,! &(! ,#$! <&()*!-)(,&%%)! &%! .$,$+-&($.!
A8!,#$!C$).&('!D$+/!>/2(,$+!3CD>6!&(!0&0$!%,)'$!4EF!G#&7#!
7/2(,%! ,#$! B$+/$%! <+/-! :4;! ./G(! ,/! ,#$! <&+%,! /A%$+1$.!
,+)(%&,&/(!H!!!I! &(! ,#$! <&()*!-)(,&%%)9!;/,#!(/+-)*&B),&/(!
)(.!A)%$?!7/(1$+%&/(!)+$!)77/-0*&%#$.! &(! <&()*!0&0$!%,)'$!
4J9!!
!"#! $%&'()*+,-./0+1+234(*425%46784"#$! .$%&'($.! :=>! )+7#&,$7,2+$! 7)(! A$! +$7/(<&'2+$.! ,/!
0$+</+-! -2*,&0*$! /0$+),&/(%! /(! A/,#! <*/),&('K0/&(,! )(.!
&(,$'$+!/0$+)(.%9!L&<<$+$(,!-/.$%!/<!/0$+),&/(!)+$M!
3I6! N2%$.O>/(,&(2/2%! :=>! /0$+),&/(M! =! 7/(,+/*! %&'()*F!
!"#$%&'(%)*)%! %#/G(! &(!N&'2+$!?* &%!2%$.! ,/!7/(<&'2+$! ,#$!
:=>! ,/! 0$+</+-! $&,#$+! <2%$.! /+! 7/(,&(2/2%! :=>!
/0$+),&/(9!!
3?6! N*/),&('K0/&(,OP(,$'$+! :=>! /0$+),&/(M! =! 7/(,+/*!
%&'()*F! +,$"'-.'! 7/(<&'2+$%! ,#$!:=>! $('&($! ,/! /0$+),$! /(!
$&,#$+! %&('*$K0+$7&%&/(! <*/),&('K0/&(,! /+! IEKA&,! &(,$'$+!
/0$+)(.%9! P(!7)%$!/<! <*/),&('K0/&(,!:=>!/0$+),&/(F!)**! ,#$!
&(02,!Q?!A&,%!/<!%&('*$K0+$7&%&/(!<*/),&('K0/&(,!/0$+)(.%!)+$!
0)%%$.! )%! &(02,%9! L2+&('! ,#$! &(,$'$+! :=>! /0$+),&/(F! ,#$!
IEKA&,%!/<!&(,$'$+!&(02,%!)+$!%$*$7,&1$*8!0)%%$.!,#+/2'#!,#$!
&(02,! A&,%! )%! %#/G(! &(! N&'2+$! R9! "/! )77/-0*&%#! &(,$'$+!
-2*,&0*&7),&/(!G&,#/2,!)(8!0+$7&%&/(!*/%%F! ,#$!IRA&,! &(,$'$+!
&(02,%! =! )(.! ;! /<! ,#$! -2*,&0*&$+! )+$! )00*&$.! ,#+/2'#! ,#$!
:4;!IRA&,%!/<!,#$!-)(,&%%)%!=!)(.!;F!+$%0$7,&1$*89!!
4Figure 3. Accumulator mantissa datapath
!!!!!!!!!!!Figure 4. Accumulator exponent datapath
!"#
Authorized licensed use limited to: University of Texas at Austin. Downloaded on April 15,2010 at 21:26:16 UTC from IEEE Xplore. Restrictions apply.