31
Application Specific Instruction Generation f Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping Fan, Guoling Han, Zhiru Zhang Supported by NSF

Application Specific Instruction Generation for Configurable Processor Architectures

  • Upload
    wilbur

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Application Specific Instruction Generation for Configurable Processor Architectures. VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping Fan , Guoling Han , Zhiru Zhang. Supported by NSF. Outline. Motivation Related Works Problem Statement Proposed Solutions - PowerPoint PPT Presentation

Citation preview

Page 1: Application Specific Instruction Generation for Configurable Processor Architectures

Application Specific Instruction Generation for Configurable Processor Architectures

VLSI CAD LabComputer Science Department, UCLA

Led by Jason Cong

Yiping Fan, Guoling Han, Zhiru Zhang

Supported by NSF

Page 2: Application Specific Instruction Generation for Configurable Processor Architectures

Outline

Motivation Related Works Problem Statement Proposed Solutions Experimental Results Conclusions

Page 3: Application Specific Instruction Generation for Configurable Processor Architectures

Motivation

Page 4: Application Specific Instruction Generation for Configurable Processor Architectures

Motivation (cont’d) Flexibility is required to satisfy different requirements and

to avoid potential design errors Application Specific Instruction-set Processors (ASIPs)

provide a solution to the tradeoff between efficiency and flexibility A general purpose processor + specific hardware resource Base instruction set + customized instructions Specific hardware resource implements the customized

instructions Either runtime reconfigurable or pre-synthesized Gain more popularity recently

IFX Carmel 20xx, ARM, Tensilica Xtensa, STM Lx, ARC Cores

Page 5: Application Specific Instruction Generation for Configurable Processor Architectures

Application Specific Instruction-set Processor

Program with basic instructions set I

t1 = a * b;

t2 = b * 0xf0;;

t3 = c * 0x12;

t4 = t1 + t2;

t5 = t2 + t3;

t6 = t5 + t4;

CustomLogic

* * *

+ ++

0xf0 0x12a b c

Execution time: 9 clock cycles

*: 2 clock cycles +: 1 clock cycles

Page 6: Application Specific Instruction Generation for Configurable Processor Architectures

Application Specific Instruction-set Processor (cont’d)

* * *

+ ++

0xf0 0x12a b c

Program with extended instructions

t1 = extop1(a, b, 0xf0);

t2 = extop2(b, c, 0xf0, 0x12);

t3 = t1 + t2;

Execution time: 5 clock cycles Speedup: 1.8

extops: 2 clock cycles +: 1 clock cyclesExtended Instruction Set: Iextop1 expop2

extop1 extop2

Page 7: Application Specific Instruction Generation for Configurable Processor Architectures

Related Works [Kastner et al, TODAES’02]

Template generation + coveringLimitation: Minimum number of templates may not lead to maximum speedup Ignore architecture constraints

[Atasu et al, DAC’03]Branch and bound Limitation: High complexity Instruction reuse is not considered

[Peymandoust et al, ICASAP’03]Instruction selection + instruction mappingLimitation: Minimize the extended instruction number

Page 8: Application Specific Instruction Generation for Configurable Processor Architectures

Preliminaries Control data flow graph (CDFG)

Basic blocks(BBK)each bbk is a DAG, denoted by G(V, E)

Control edges Cone

A subgraph consisting of node v and its predecessors such that any path connecting a node in the cone and v lies entirely in the cone

K-feasible cone Pattern

A single output DAG Trivial pattern Nontrivial pattern Associated with execution time, number of I/O,

area

Trivial PatternExecution timeI/O: 2-in 1-out

* * *

+ +

+

0xf0 0x12a b c

n1 n2 n3

n4 n5

n6

Nontrivial PatternSW Execution timeHW Execution time

I/O: 2-in 1-outArea: 2

{a, b, 0xf0}

Page 9: Application Specific Instruction Generation for Configurable Processor Architectures

Problem StatementGiven:

G(V, E) The basic instruction set I Pattern constraints:

I. Number of inputs |PI(pi)| Nin, i;II. Number of outputs |PO(pi)| = 1, i; III. Total area

Objective: Generate a pattern library P Map G to the extended instruction set IP, so that the total

execution time is minimized.

1

( )ii N

area p A

Page 10: Application Specific Instruction Generation for Configurable Processor Architectures

Problem DecompositionSub-problem 1. Pattern Enumeration:

Generate all of the patterns S satisfying the constraints (i) and (ii) from G(V, E).

Sub-problem 2. Instruction Set Selection:Select a subset P of S to maximize the potential speedup while satisfying the area constraint.

Sub-problem 3. Application Mapping:Map G(V, E) to IP so that the total execution time of G is minimized.

Page 11: Application Specific Instruction Generation for Configurable Processor Architectures

Proposed ASIP Compilation Flow

Instruction Implementation / ASIP synthesis

Pattern Generation / Pattern Selection

Application Mapping Pattern library

C

ASIP constraints

Implementation

Mapped CDFG

SUIF / CDFG generator

CDFG

Page 12: Application Specific Instruction Generation for Configurable Processor Architectures

1. Pattern Enumeration All possible application specific instruction

patterns should be enumerated Each pattern is a k-feasible cone Cut enumeration is used to enumerate all

the k-feasible cones [cong et al, FPGA’99] In topological order, merge the cuts of fan-

ins and discards those cuts not k-feasible

Page 13: Application Specific Instruction Generation for Configurable Processor Architectures

1. Pattern Enumeration (cont’d)

3-feasible cones:n1: {a, b}

n2: {b, 0xf0}

n3: {c, 0x12}

* * *

+ +

+

0xf0 0x12a b c

n1 n2n3

n4 n5

n6

n4: {n1, n2},

n5: {n2, n3}, {n2, c, 0x12}, {n3, b, 0xf0}

{b, 0xf0, c, 0x12}n6: {n4, n5}, {n4, n2, n3}, {n5, n1, n2}

{n1, b, 0xf0}, {n2, a, b}, {a, b, 0xf0}

Page 14: Application Specific Instruction Generation for Configurable Processor Architectures

2. Pattern Selection (1) Resource cost and the execution time can

be obtained using high-level estimation tool

The extended instructions should satisfy the area constraintUse all the enumerated patterns

Optimal code can be generated Mapping becomes unaffordable

Heuristically select a set of patterns

Page 15: Application Specific Instruction Generation for Configurable Processor Architectures

2. Pattern Selection (2)

Basic idea: simultaneously consider speed up, occurrence frequency and area.

Speedup Tsw(p) = |V(p)|

Thw(p) = Length of the critical path of scheduled p Speedup(p) = Tsw(p) / Thw(p)

Occurrence Some pattern instances may be isomorphic Graph isomorphism test [ Nauty Package ] Small subgraphs, isomorphism test is very fast

Gain(p) = Speedup(p) Occurrence(p)

* * *

+ ++

0xf0 0x12a b c

n1 n2 n3

n4 n5

n6

Pattern *+

Tsw= 3

Thw= 2

Speedup = 1.5

Page 16: Application Specific Instruction Generation for Configurable Processor Architectures

2. Pattern Selection (3)

Selection under Area Constraint Can be formulated as a 0-1 knapsack problem

0-1 knapsack problem: Given n items (patterns) and weight W (area constraint A), and the ith item (pattern) is associated with value (gain) vi and weight (area) wi, select a subset of items to maximize the total value, while the total weight does not exceed W.

Optimally solvable by Dynamic programming algorithm.

Page 17: Application Specific Instruction Generation for Configurable Processor Architectures

3. Application Mapping (1)

Application mapping covers each node in G(V, E) with the extended instruction set to minimize the execution time.

The execution time of a mapped DAG is defined as the sum of the execution time of the patterns covering the DAG.

: non-trivial : trivial

( ) ( )hw swp p

T T p T p

Page 18: Application Specific Instruction Generation for Configurable Processor Architectures

3. Application Mapping (2)

Theorem: The application mapping problem is equivalent to the minimum-area technology mapping problem. Execution time ↔ area Total area = sum of area of each component Total execution time = sum of execution time of each

pattern Minimum-area mapping is NP-hard → application

mapping is NP-hard A lot of minimum-area technology mapping algorithms

Page 19: Application Specific Instruction Generation for Configurable Processor Architectures

Minimum-area technology mapping

[Keutzer, DAC’87 ] Tree decomposition + dynamic programming

[Rudell] [Liao, ICCAD’95]Min-cost binate covering Given:

a boolean function f with variable set X a cost function which maps X to a nonnegative integer

Objective: find an assignment for each variable so that the value of f is 1

and the sum of cost is minimized

Page 20: Application Specific Instruction Generation for Configurable Processor Architectures

Binate Covering (1)

* * *

+ +

+

0xf0 0x12a b c

n1 n2n3

n4 n5

n6

Pattern Function Cost Covers

p0 + 1 n6

p1 + 1 n5

p2 + 1 n4

p3 * 2 n3

p4 * 2 n2

p5 * 2 n1

p6 *+ 2 n1 , n4

p7 *+ 2 n2, n4

p8 *+ 2 n2, n5

p9 *+ 2 n3, n5

p10 (*)+(*) 2 n1, n2, n4

p11 (*)+(*) 2 n2, n3, n5

Page 21: Application Specific Instruction Generation for Configurable Processor Architectures

Binate Covering (2)

* * *

+ +

+

0xf0 0x12a b c

n1 n2n3

n4 n5

n6

Pattern Function Cost Covers

p0 + 1 n6

p1 + 1 n5

p2 + 1 n4

p3 * 2 n3

p4 * 2 n2

p5 * 2 n1

p6 *+ 2 n1 , n4

p7 *+ 2 n2, n4

p8 *+ 2 n2, n5

p9 *+ 2 n3, n5

p10 (*)+(*) 2 n1, n2, n4

p11 (*)+(*) 2 n2, n3, n5

Covering clause: p0

The fan-ins of the sink node need be covered by some pattern

Page 22: Application Specific Instruction Generation for Configurable Processor Architectures

Binate Covering (3)

* * *

+ +

+

0xf0 0x12a b c

n1 n2n3

n4 n5

n6

The nodes that generate inputs to pi must be covered by some other pattern

Pattern Function Cost Covers

p0 + 1 n6

p1 + 1 n5

p2 + 1 n4

p3 * 2 n3

p4 * 2 n2

p5 * 2 n1

p6 *+ 2 n1 , n4

p7 *+ 2 n2, n4

p8 *+ 2 n2, n5

p9 *+ 2 n3, n5

p10 (*)+(*) 2 n1, n2, n4

p11 (*)+(*) 2 n2, n3, n5

Covering clause: p2+p6+p7+p10

Page 23: Application Specific Instruction Generation for Configurable Processor Architectures

Binate Covering (4)

* * *

+ +

+

0xf0 0x12a b c

n1 n2n3

n4 n5

n6

Pattern Function Cost Covers

p0 + 1 n6

p1 + 1 n5

p2 + 1 n4

p3 * 2 n3

p4 * 2 n2

p5 * 2 n1

p6 *+ 2 n1 , n4

p7 *+ 2 n2, n4

p8 *+ 2 n2, n5

p9 *+ 2 n3, n5

p10 (*)+(*) 2 n1, n2, n4

p11 (*)+(*) 2 n2, n3, n5

p2 →p4 & p2 →p5

¬p2 + p4 & ¬ p2 + p5

Page 24: Application Specific Instruction Generation for Configurable Processor Architectures

Binate Covering (4)

* * *

+ +

+

0xf0 0x12a b c

n1 n2n3

n4 n5

n6

Pattern Function Cost Covers

p0 + 1 n6

p1 + 1 n5

p2 + 1 n4

p3 * 2 n3

p4 * 2 n2

p5 * 2 n1

p6 *+ 2 n1 , n4

p7 *+ 2 n2, n4

p8 *+ 2 n2, n5

p9 *+ 2 n3, n5

p10 (*)+(*) 2 n1, n2, n4

p11 (*)+(*) 2 n2, n3, n5

¬p6 + p4

¬p7 + p5

Page 25: Application Specific Instruction Generation for Configurable Processor Architectures

Binate Covering (5)

* * *

+ +

+

0xf0 0x12a b c

n1 n2n3

n4 n5

n6

f = p0(p2+p6+p7+p10)(¬p2 + p4)(¬ p2 + p5)(¬p6 + p4)(¬p7 + p5) (p1+p8+p9+p11) (¬p1 + p3)(¬ p1 + p4) (¬p8 + p3)(¬p9 + p4)

min-cost cover: p0, p10, p11 with cost 1+2+2 = 5

Page 26: Application Specific Instruction Generation for Configurable Processor Architectures

Experimental Results (1)

A commercial reconfigurable system – Nios from Altera is used to implement the ASIPs. 5 extended instruction formats up to 2048 instructions for each format

Some DSP applications are taken as benchmark Altera’s Quartus II 3.0 is used to aid the

synthesis and the physical design of the extended instructions.

Page 27: Application Specific Instruction Generation for Configurable Processor Architectures

Experimental Results (2)

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10Pattern Size

Occ

urr

ence

fft_briirfirprdirmcm

Pattern size vs. number of pattern instances (2-input patterns)

Page 28: Application Specific Instruction Generation for Configurable Processor Architectures

Experimental Results (3)

Speedup under different input size constraints

0

1

2

3

4

5

6

7

8

fft_br iir fir pr dir mcm

Spee

dup

4-inputs

3-inputs

2-inputsSpeedup = Textended/ Tbasic

Ideal speedup

• pipeline hazard

• memory impact

Page 29: Application Specific Instruction Generation for Configurable Processor Architectures

Experimental Results (4)Speedup and resource overhead on Nios implementations

-1.77%-2.54%-2.75 3.08 Average

560.00%02.76%1863.224.754mcm

160.00%00.80%543.023.282dir

140.00%01.05%711.751.572pr

80.15%1,0240.76%512.142.402fir

400.71%4,7363.79%2553.733.187iir

169.79%65,5366.06%4082.653.289fft_br

DSP BlockMemoryLENiosEstimation

Resource OverheadSpeedupExtended Instruction #

-1.77%-2.54%-2.75 3.08 Average

560.00%02.76%1863.224.754mcm

160.00%00.80%543.023.282dir

140.00%01.05%711.751.572pr

80.15%1,0240.76%512.142.402fir

400.71%4,7363.79%2553.733.187iir

169.79%65,5366.06%4082.653.289fft_br

DSP BlockMemoryLENiosEstimation

Resource OverheadSpeedupExtended Instruction #

Page 30: Application Specific Instruction Generation for Configurable Processor Architectures

Conclusions Propose a set of algorithms for ASIP

compilationActual performance metric is used as the

optimization objective Reduce the instruction mapping problem into

an area-minimization logic covering problemOperation duplication is considered implicitly

Experiments show encouraging speedup

Page 31: Application Specific Instruction Generation for Configurable Processor Architectures

Thank YouThank You