29
University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan Clark, and Scott Mahlke Advanced Computer Architecture Lab. University of Michigan

Data-centric Subgraph Mapping for Narrow Computation Accelerators

  • Upload
    alain

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

Data-centric Subgraph Mapping for Narrow Computation Accelerators. Amir Hormati, Nathan Clark, and Scott Mahlke Advanced Computer Architecture Lab. University of Michigan. Introduction. Migration of applications Programmability and cost issues in ASIC - PowerPoint PPT Presentation

Citation preview

Page 1: Data-centric Subgraph Mapping for Narrow Computation Accelerators

University of MichiganElectrical Engineering and Computer Science

Data-centric Subgraph Mapping for Narrow Computation Accelerators

Amir Hormati, Nathan Clark,

and Scott Mahlke

Advanced Computer Architecture Lab.

University of Michigan

Page 2: Data-centric Subgraph Mapping for Narrow Computation Accelerators

2 University of MichiganElectrical Engineering and Computer Science

Introduction• Migration of applications

• Programmability and cost issues in ASIC

• More functionality in the embedded processor

Page 3: Data-centric Subgraph Mapping for Narrow Computation Accelerators

3 University of MichiganElectrical Engineering and Computer Science

What Are the Challenges Accelerator Hardware: Compiler Algorithm:

Page 4: Data-centric Subgraph Mapping for Narrow Computation Accelerators

4 University of MichiganElectrical Engineering and Computer Science

Configurable Compute Array (CCA)

• Array of FUs

• Arithmetic/logic

• 32-bit functional units

• Full interconnect betweenrows

• Supports 95% of allcomputation patterns

(Nathan Clark, ISCA 2005)

Input1 Input2 Input3 Input4

Output1 Output2

Page 5: Data-centric Subgraph Mapping for Narrow Computation Accelerators

5 University of MichiganElectrical Engineering and Computer Science

Report Card on the Original CCA

• Easy to integrate to current embedded systems

• High performance gain

however...

• 32-bit general purpose CCA:– 130nm standard cell library– Area requirement: 0.3mm2

– Latency: 3.3nsdie photo of a processor with CCA

Page 6: Data-centric Subgraph Mapping for Narrow Computation Accelerators

6 University of MichiganElectrical Engineering and Computer Science

Objectives of this Work

• Redesign of the CCA hardware– Area– Latency

• Compilation strategy– Code quality– Runtime

Page 7: Data-centric Subgraph Mapping for Narrow Computation Accelerators

7 University of MichiganElectrical Engineering and Computer Science

Width Utilization

• Full width of the FUs is not always needed.

• Narrower FUs is not the solution.

Benchmark Less than 16-bit

Less than 8-bit

Rawcaudio 94% 52%

Rawdaudio 91% 60%

Epic 80% 45%

Unepic 74% 40%

Cjpeg 76% 49%

Djpeg 70% 53%

Larger than 16-bit

Larger than 8-bit

3des 86% 90%

bitcount 80% 85%

rijndael 50% 64%

Page 8: Data-centric Subgraph Mapping for Narrow Computation Accelerators

8 University of MichiganElectrical Engineering and Computer Science

Width-Aware Narrow CCA

Width CheckerCarry bits

[8-31]

[8-31]

[8-31]

[8-31]

Iterate

IterationController

Input Registers

Carry Bits

Iterate

[8-31]

[0-7]

Output 1 Output 2

-

[0-7][0-7]

[0-7]

Output Registers

CCA

[8-31]

[8-31]

[8-31]

Page 9: Data-centric Subgraph Mapping for Narrow Computation Accelerators

9 University of MichiganElectrical Engineering and Computer Science

Sparse Interconnect

• Rank wires based on utilization.

• >50% wires removed.

• 91% of all patterns are supported.

Input1 Input2 Input3 Input4

Output1 Output2

Input1 Input2 Input3 Input4

Output1 Output2

Page 10: Data-centric Subgraph Mapping for Narrow Computation Accelerators

10 University of MichiganElectrical Engineering and Computer Science

Synthesis Results

Accelerator Configuration Latency (ns) Area(mm2)

32-bit with full interconnect 3.30 0.301

32-bit with sparse interconnect 2.95 0.270

16-bit with full interconnect 2.88 0.168

16-bit with sparse interconnect 2.55 0.140

8-bit with full interconnect 2.56 0.080

8-bit with sparse interconnect 2.00 0.070

Width Checker 0.39 0.002

• Synthesized using Synopsys and Encounter in 130nm library.

Page 11: Data-centric Subgraph Mapping for Narrow Computation Accelerators

11 University of MichiganElectrical Engineering and Computer Science

Compilation Challenges

• Best portions of the code

• Non-uniform latency

• What are the current solutions:– Hand coding– Function intrinsics– Greedy solution

Page 12: Data-centric Subgraph Mapping for Narrow Computation Accelerators

12 University of MichiganElectrical Engineering and Computer Science

Step 1: Enumeration

Live Out

Live In

ADD

AND

ADD

OR

XOR

AND

ADD

CMP

Live Out

Live Out

Live In

3

4

1

2

5

6

7

8

Live In

3ADD

8

OR

ADD

XOR

6

7

AND

ADD3

4

6

AND

ADD

ADD

3

5

Page 13: Data-centric Subgraph Mapping for Narrow Computation Accelerators

13 University of MichiganElectrical Engineering and Computer Science

Step 2: Subgraph Isomorphism Pruning

• Ensure subgraphs can run on accelerator

6SUB

11ADD

10SHRA

8SHL3AND << * Logic

>> >> +/-

+/-+/-

A B C

D E F

G H

<< * 3

>> >> +/-

+/-+/-

A B C

D E F

G H

<< * 3

>> >> 6

+/-+/-

A B C

D E F

G H

<< * 3

>> >> 6

11+/-

A B C

D E F

G H

<< * 3

>> 10 6

11+/-

A B C

D E F

G H

<< * 3

10 >> 6

11+/-

A B C

D E F

G H

8 * 3

10 >> 6

11+/-

A B C

D E F

G H

Page 14: Data-centric Subgraph Mapping for Narrow Computation Accelerators

14 University of MichiganElectrical Engineering and Computer Science

Step 3: Grouping

Live Out

Live In

ADD

AND

ADD

OR

XOR

AND

ADD

CMP

Live Out

Live Out

Live In

3

4

1

2

5

6

7

8

Live In

A

BC

DF

E

Live Out

Live In

ADD

AND

ADD

OR

XOR

AND

ADD

CMP

Live Out

Live Out

Live In

3

4

1

2

5

6

7

8

Live In

A

BC

DF

E

AC

• Assuming A and C are the only possibilities for grouping.

Page 15: Data-centric Subgraph Mapping for Narrow Computation Accelerators

15 University of MichiganElectrical Engineering and Computer Science

Dealing with Non-uniform Latency

OR

ADD

AND

W[0,8] W[9,16] W[17,24] W[25,32] Average Latency

ADD 100% 0% 0% 0% 1

OR 0% 50% 0% 50% 3

AND 0% 50% 50% 0% 2.5

Subgraph Cost:3 Benefit: 0

8 bit

24 bit

8 bit

24 bit

8 bit

24 bit

A

B

C

Average Latency =2

Average Latency =2

Average Latency =2Time

• >94% do not change width

Page 16: Data-centric Subgraph Mapping for Narrow Computation Accelerators

16 University of MichiganElectrical Engineering and Computer Science

Step 4: Unate CoveringWidth Op ID A B C AC D E F G H … N

24 1 1 1 1 …

8 2 1 1 1 …

24 3 1 1 1 1 …

8 4 1 1 1 …

32 5 1 1 …

32 6 1 1 …

8 7 1 1 …

8 8 1 1 … 1

Cost 3 4 3 3 1 4 4 1 1 … 1

Benefit -1 -1 -1 1 1 -1 -1 0 0 … 01

3

1

1

1

1

AC

0…001Benefit

1…111Cost

1…188

…178

…632

…532

…48

…324

…128

…1124

N…HGDOp IDWidth

Page 17: Data-centric Subgraph Mapping for Narrow Computation Accelerators

17 University of MichiganElectrical Engineering and Computer Science

Experimental Evaluation

• ARM port of Trimaran compiler system

• Processor model– ARM-926EJS– Single issue, in-order execution, 5 stage pipeline– I/D caches : 16k, 64-way

• Hardware simulation: SimpleScalar 4.0

Page 18: Data-centric Subgraph Mapping for Narrow Computation Accelerators

18 University of MichiganElectrical Engineering and Computer Science

Comparison of Different CCAs

0

10

20

30

40

50

60

70

80

90

Benchmarks

Per

cent

Spe

edup

32-bit CCA 16-bit CCA 8-bit CCA

16-bit and 8-bit CCAs are 7% and 9% better than 32-bit CCA.

• Assuming clock speed(1/(3.3ns) = 300 MHZ)

Page 19: Data-centric Subgraph Mapping for Narrow Computation Accelerators

19 University of MichiganElectrical Engineering and Computer Science

Comparison of Different Algorithms

0

5

10

15

20

25

30

35

md5

blowfis

h3d

es sha

sobe

lrc

4cjp

egdjp

eg epic

unep

ic

g721

deco

de

g721

enco

de

mpe

g2de

c

mpe

g2en

c

rawca

udio

rawda

udio

rasta

rijnda

el

dijks

tra_la

rge

susa

n rls LU

bitco

unt

Averg

e

Benchmarks

Pe

rce

nt

Sp

ee

du

p

Data-centric Data-unaware

• Previous work: Greedy 10% worse than data-unaware

Page 20: Data-centric Subgraph Mapping for Narrow Computation Accelerators

20 University of MichiganElectrical Engineering and Computer Science

Conclusion

• Programmable hardware accelerator • Width-aware CCA: Optimizes for common

case.• 64% faster clock • 4.2x smaller

• Data-centric compilation: Deals with non-uniform latency of CCA.• Average 6.5%,• Max 12% better than data-unaware algorithm.

Page 21: Data-centric Subgraph Mapping for Narrow Computation Accelerators

21 University of MichiganElectrical Engineering and Computer Science

?For more information: http://cccp.eecs.umich.edu/

Page 22: Data-centric Subgraph Mapping for Narrow Computation Accelerators

22 University of MichiganElectrical Engineering and Computer Science

Data-Centric FEUTotal Runtime

0.01

0.1

1

10

100

1000

10000

0 50 100 150 200 250 300

Block Size

Tim

e(s

ec

on

d)

89%96%

2

99%

Page 23: Data-centric Subgraph Mapping for Narrow Computation Accelerators

23 University of MichiganElectrical Engineering and Computer Science

FU FU

FU

A B C D1 D 0 C 2 0 0 8

ADD

1

OR

0

ADD

1

0 0

0

1

89

B C D

ADD

0

OR

0

ADD

0

A1 D 0 C 2 0 0 8

ADD

0

OR

0

ADD

0

1 0

1

5 1

22

Operation of Narrow CCA

[(0x1D + 0x0C) + (0x20 OR 0x08)]

Page 24: Data-centric Subgraph Mapping for Narrow Computation Accelerators

24 University of MichiganElectrical Engineering and Computer Science

Data-Centric Subgraph Mapping

• Enumerate– All subgraphs

• Pruning– Subgraph isomorphism

• Grouping– Iteratively group

disconnected subgraphs

• Selection– Unate covering

• Shrink search space to control runtime

Enumeration

Pruning

Grouping

Selection

Page 25: Data-centric Subgraph Mapping for Narrow Computation Accelerators

25 University of MichiganElectrical Engineering and Computer Science

How Good is the Cost Function

0.75

0.80

0.85

0.90

0.95

1.00

md5

blowfis

h3d

es sha

sobe

lrc

4cjp

egdjp

egep

ic

unep

ic

g721

deco

de

g721

enco

de

mpe

g2dec

mpe

g2enc

rawca

udio

rawda

udiora

sta

rijndae

l

dijks

tra_la

rge

susa

n rls LU

bitco

unt

Averg

e

Benchmarks

No

rmal

ized

Wid

th V

aria

nce

Almost all of the operands have the same width range through out the execution.

Page 26: Data-centric Subgraph Mapping for Narrow Computation Accelerators

26 University of MichiganElectrical Engineering and Computer Science

Page 27: Data-centric Subgraph Mapping for Narrow Computation Accelerators

27 University of MichiganElectrical Engineering and Computer Science

Width Utilization

• Full width of the FUs is not always needed.

• Replacing FUs with narrower FUs is not a good idea by itself.

Benchmark Less than 16-bit

Less than 8-bit

Rawcaudio 94% 52%

Rawdaudio 91% 60%

Epic 80% 45%

Unepic 74% 40%

Cjpeg 76% 49%

Djpeg 70% 53%

Larger than 16-bit

Larger than 8-bit

3des 86% 90%

bitcount 80% 85%

rijndael 50% 64%

Page 28: Data-centric Subgraph Mapping for Narrow Computation Accelerators

28 University of MichiganElectrical Engineering and Computer Science

Introduction• Migration of applications

• Programmability and cost issues in ASIC

• More functionality in the embedded processor

Page 29: Data-centric Subgraph Mapping for Narrow Computation Accelerators

29 University of MichiganElectrical Engineering and Computer Science

What Are the Challenges Accelerator Hardware: Compiler Algorithm: