Architecture and Synthesis for Power-Efficient FPGAscong/slides/edp04_lp_fpga_webversion.pdfArchitecture and Synthesis for Power-Efficient FPGAs Jason Cong University of California,

Architecture and Synthesis for Power-Efficient FPGAs

Jason CongUniversity of California, Los Angeles

[email protected]

Partially supported by NSF Grants CCR-0096383, and CCR-0306682, and Altera under the California MICRO program

UCLAUCLA

Joint work with Deming Chen, Lei He, Fei Li, Yan Lin

Outline

IntroductionUnderstanding Power Consumption in FPGAsArchitecture Evaluation and Power OptimizationLow Power SynthesisConclusions

Why? FPGA is Known to be Power Inefficient!

FPGA consumes 50-100X more powerWhy do we care about power optimization for FPGAs ?!

Source:[Zuchowski, et al, ICCAD02]

FPGA Advantages

Short TAT (total turnaround time)No or very low NRE

ASICs Become Increasingly Expensive

Traditional ASIC designs are facing rapid increase of NRE and mask-set costs at 90nm and below

Source: EETimes

7.512

40

60

$0.0

$0.5

$1.0

$1.5

$2.0

$2.5

250nm 180nm 130nm 100nm

Tot

al C

ost f

or M

ask

Set (

$M)

0

$10

$20

$30

$40

$50

$60

Cos

t/Mas

k ($

K)

Process (um) 2.0 … 0.8 0.6 0.35 0.25 0.18 0.13 0.10

Single Mask cost ($K) 1.5 1.5 2.5 4.5 7.5 12 40 60

# of Masks 12 12 12 16 20 26 30 34

Mask Set cost ($K) 18 18 30 72 150 312 1,000 2,000

Our Research

Power EfficientFPGAs

Circuit Design

Fabric Design

System Design

Synthesis Tools

Outline


FPGA Architecture

Programmable IO

KLUTInputs D FF

Clock

Out

BLE# 1

BLE# N

NOutputs

I Inputs

Clock

I

N

Programmable Logic

Programmable Routing

BC-Netlist

BC-NetlistGenerator

Power Simulator

Power

BLIF

Logic Optimization(SIS)

Tech-Mapping (RASP)

Timing-Driven Packing (TV-Pack)

Placement & Routing (VPR)

SLIF

DelayArea

Arch Spec

BLIF

Logic Optimization(SIS)

Tech-Mapping (RASP)

Timing-Driven Packing (TV-Pack)

Placement & Routing (VPR)

SLIF

DelayArea

Evaluation Framework – fpgaEva-LP

fpgaEva-LP [Li, et al, FPGA’03]

BC-Netlist Generator

Mapped Netlist Layout

Buffer Extraction

Netlist Generation for Logic Clusters

Capacitance Extraction

Delay Calculation

BC-Netlist

Back-annotation

Mixed-level Power Model – Overview

Dynamic powerSwitching power Short-circuit power

Related to signal transitions

Functional switchGlitch

Dynamic

Interconnect & clock

Macro-modelMacro-modelStatic

Switch-level model

Macro-model

Logic Blockcomponents

power sources

Static PowerSub-threshold leakage Gate leakageReverse biased leakage

Depending on the input vector

Cycle-Accurate Power Simulator

Mixed-level Power Model

Post-layout extracted delay & capacitance

Random Vector Generation

BC-Netlist

Cycle Accurate Power Simulation with Glitch Analysis

All cycles finished?

No

Power Values

Yes∑ ∑∈ ∈

+=activei idlej

sacycle nEnEE

)()(

Logic Block Power19%

Interconnect Power59%

Clock Power22%

Power Breakdown

Interconnect power is dominant

Cluster Size = 12, LUT Size = 4

Clock Power15%

Interconnect Power45%

Logic Block Power40%

Cluster Size = 12, LUT Size = 6

Power Breakdown (cont’d)

Leakage Power42%

Dynamic Power58%

Dynamic Power48%

Leakage Power52%

Leakage power becomes increasingly important (100nm)

Cluster Size = 12, LUT Size = 4 Cluster Size = 12, LUT Size = 6

Outline

IntroductionUnderstanding Power Consumption in FPGAsArchitecture Evaluation and Power Optimization

Architecture Parameter SelectionDual-Vdd/Dual-Vt FPGA Architecture

Low Power Synthesis with Dual-VddConclusion

Total Power along LUT and Cluster Size Changes

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

3 4 5 6 7

LUT Size

Tota

l FPG

A P

ower

(nor

mal

ized

ge

omet

ric m

ean)

Cluster Size = 4Cluster Size = 6Cluster Size = 8Cluster Size = 10Cluster Size = 12

Routing architecture: segmented wire with length of 4, and 50% tri-state buffers in routing switches

Routing Architecture Evaluation

Architecture of Low-power and High-performance

0.78651.02680.88651.0502

Cluster size 12,LUT size 4,

Wire segment length 4,100% buffered routing

switches

High-performance

(Et3)

1.00800.89090.99040.9653

Cluster size 10, LUT size 4,

wire segment length 4,25% buffered routing switches

Low-power(E3t)

Et3E3tDelay (t)

Energy (E)

Best FPGA architectureApplications

Arch. Parameter selection leads to 10% power/delay trade-offUniform FPGA fabrics provide limited power-performance tradeoffNeed to explore heterogeneous FPGA fabrics, e.g. dual-Vt and dual-Vdd fabrics

Outline

IntroductionUnderstanding Power Consumption in FPGAsArchitecture Evaluation and Power Optimization

Architecture Parameter SelectionDual-Vdd/Dual-Vt FPGA Architecture [Li, et al, FPGA’04]

Low Power Synthesis with Dual-VddConclusion

Dual-Vdd LUT DesignDual-Vdd technique makes use of the timing slack to reduce power

VddH devices on critical path performanceVddL devices on non-critical paths powerAssume uniform Vdd for one LUT

Threshold voltage Vt should be adjusted carefullyfor different Vdd levels

To compensate delay increaseTo avoid excessive leakage power increase

Vdd/Vt-Scaling for LUTsThree scaling schemes

Constant-Vt scalingFixed-Vdd/Vt-ratio scalingConstant-leakage scaling

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1.3v 1.0v 0.9v 0.8vVdd (V)

Del

ay (n

s)

constant Vtfixed-Vdd/Vt-ratioconstant leakage

0

1

2

3

4

5

6

7

8

9

10

1.3v 1.0v 0.9v 0.8v

Vdd (V)

Leak

age

Pow

er (

uW)

constant Vtfixed-Vdd/Vt-ratioconstant leakage

Constant-leakage scaling obtains a good tradeoffuseful for both single-Vddscaling and dual-Vdd design

Dual-Vt LUT DesignLUT is divided into two parts

Part I: configuration cells high VtPart II: MUX tree and input buffers normal Vt (decided by constant-leakage Vdd-scaling)

Configuration SRAM cellsContent remains unchanged after configurationRead/write delay is not related to FPGA performance

Use high Vt ~40% of VddMaintain signal integrityReduce SRAM leakage by 15Xand LUT leakage by 2.4XIncrease configuration time by 13%

Pre-Defined Dual-Vt FabricPower saving

11.6% for combinational circuits14.6% for sequential circuits

12.4%0.180spla9.4%0.0927seq

power savingpower (watt)

11.6%Avg.

14.7%0.256pdc9.4%0.0753misex311.6%0.059ex5p17.3%0.179ex101010.7%0.234des12.3%0.0536apex49.3%0.108apex28.5%0.0798alu4

arch-SVDT (Dual Vt)

arch-SVST (Single Vt)Circuit

Table1 Combinational circuits

14.0%0.0351tseng10.2%0.261s38484

power savingpower (watt)

14.6%Avg.

11.7%0.307s3841713.4%0.0736s29819.2%0.190frisc16.3%0.140elliptic14.5%0.134dsip19.7%0.0391diffeq14.8%0.632clma12.3%0.148bigkey

arch-SVDT (Dual Vt)

arch-SVST(Single Vt)circuit

Table2 Sequential circuits

Dual-Vdd FPGA FabricGranularity: logic block (i.e., cluster of LUTs)

Smaller granularity => intuitively more power savingBut a larger implementation overhead

Layout pattern: pre-defined dual-Vdd patternRow-based or interleaved patternRatio of VddL/VddH blocks is 2:1 (benchmark profiling)

Interconnect uses uniform VddH

L-block: VddL

H-block: VddH

Simple Design Flow for Dual-Vdd FabricBased on traditional design flow, but with new steps

Step I: LUT mapping (FlowMap) + P & Rassuming uniform VddH (using VPR)

Step II: Dual-Vdd assignment based on sensitivity

Setp III: Timing driven P & R considering pre-defined dual-Vdd pattern (modified VPR)

Comparison Between Vdd-Scaling and Dual-Vdd

For high clock frequency, dual Vdd achieves ~6% total power saving (~18% logic power saving)For low clock frequency, single-Vdd scaling is betterStill a large gap between ideal dual-Vdd and real case

Ideal dual-Vdd is the result without layout pattern constraint

circuit: alu4

0.03

0.04

0.05

0.06

0.07

0.08

0.09

65 75 85 95 105 115 125

Max. Clock Frequency (MHz)

Power (watt)

arch-SVDT (Vdd Scaling)

arch-DVDT(ideal case)

arch-DVDT(pre-defined Vdd)

1.3v

1.0v

0.9v

1.3v/0.8v

1.0v/0.8v0.9v/0.8v

1.5v

1.5/1.0v

1.3/1.0v

1.0/0.9v

1.5v/1.0v

1.3/0.9v

Vdd-Programmable Logic BlockPower switches for Vdd selection and power gatingOne-bit control is needed for Vdd selection, but two-bit control power gating

Experimental Results with Vdd-Programmable Blocks

Power v.s. performanceCircuit: alu4

0.03

0.04

0.05

0.06

0.07

0.08

0.09

65 75 85 95 105 115 125

clock frequency (MHz)

total power (watt)

arch-SV (Vdd scaling)arch-DV (configurable Vdd)arch-DV (ideal case)arch-DV (pre-defined Vdd)

1.3

v

1.0v

1.5v/1.0v

1.3v/0.8v

1.0v/0.8v

1.5v/1.0v

1.3v/0.9v

1.0v/0.8v

1.5v/0.8v

1.3v/0.8v

1.0v/0.9v

1.5v

0.9v/0.8v

1.0v/0.8v

1.3v/0.8v

1.5v/1.0v

Outline


Low Power Synthesis for Dual Vdd FPGAs

FPGA architecture with dual-Vdds adds new layout constraints for synthesis toolsNovel synthesis tools are required to support the architecture

Technology mapping [Chen, et al, FPGA’04]Circuit clustering [Chen, et al, ISLPED’04]

Technology Mapping for Low-Power FPGAs with Dual Vdds

ac

d

yxz

b

w

e

fg

Cut Enumeration:

Topological Order from PIs to POs.

Delay 1, Power 1

Delay 2, Power 2

Optimal Delay = 1

Power = 1.5

Optimal Delay = 2

Power = 2.5

Delay 2, Power 3.2

Delay 2, Power 3.5

Delay 2, Power 2.5

Optimal Delay = 1

Power = 1

Optimal Delay = 1

Power = 1

Represent 1 case: single high Vdd case

Dual-Vdd Cases

Consider:Converter delay & powerVddL LUT delay & powerVddH LUT delay & power

a c

d

yxz

b

w

eTargetLUT

Cases Input LUT Target LUT Converter1 VddL VddL No2 VddL VddH Yes3 VddH VddL No4 VddH VddH No

Input LUT

Four extra cases for dual-Vddconsideration

Produce these four cases for each cut and nodeMore tradeoff solution points Smaller power requires larger delaySmaller delay requires larger power

Low Vdd LUTHigh Vdd LUT

Mapping Solution Generation

From POs to PIsCritical path driven by VddHLUTNon-critical paths can be driven by VddL LUT, guided by low power

ac

d

yxz

b

w

e

fg

Two Types of Required Times

VddL VddH

33.2

R

x y

If R is using VddH:

converter

Req’d times

Mapped LUTs

1.7 = 2.0 - 0.3

Critical path

If R is using VddL:

Critical path

1.8 2.0 Req’d times propagated back

Req’d time of R is 1.7

Req’d time of R is 1.8

To be mapped

Each node maintains two req’d times:

Propagated separately

Interact with each other

Experimental Results

- 2.10%- 1.29%0.56%- 4.04%

Real powerEst'ed powerTotal edgesMapping area

SVmap (Single high Vdd) compared to Emap [Lamoureux, ICCAD03]

Mapping area considerably betterEstimated power very close to the real power reported after P&R

- 9.44%- 10.72%- 11.63%v1.3 - v1.0v1.3 - v0.9v1.3 - v0.8v1.3

DVmapSVmap

DVmap (dual Vdd) compared to SVmap

v1.3 as VddH and v0.8 as VddL is the best combination

Circuit Clustering with Dual VddsGiven:

A mapped FPGA designAn FPGA architecture with Dual-

Vdd configurable logic blocksGoal:

Cluster the LUTs into logic blocksAssign voltages to the logic blocks

such that the design hasOptimal delay Minimum power

Constraints:Logic Block Inputs ≤ KLogic Block Size ≤ MLogic Block Outputs ≤ MLUT delay = dL or dHInter-block edge delay = D

Input = 5Size = 3Output = 2

LUT

LUT LUT

LUTLUT

LUTLUT

Cluster Enumeration – An Example

m n o p q

r s

t To get a cluster of size 6 on LUT tGet 1 node on r, 4 on s, then

merge with t …., and

Get 2 nodes on r, 3 on s …

Common nodesPIs to POs

Dynamic Programming

Get 3 on r …

Solution Generation

m n o p q

r s

t

Cluster s1

Cluster s2

Solution propagation similar as [Vaishnav, ICCAD’99]

Delay, power and voltage (form solution points) propagate through the clusters and nodes iteratively

Try to get solutions for Cluster t1

Get solutions for sGet solutions for r

Solution Curve on Node rGood solutions: Any two delay-power-vdd points

(D1, P1, V1) and (D2, P2, V2)if D1 > D2, then P1 < P2if D1 < D2, then P1 > P2

02468

1012

0 1 2 3 4 5 6 7

H LH

Lpower

Delay

H105L9.85.4H8.246.1L86.2

VddPowerDelay

Good delay-power-vdd points The corresponding solution curve

Solution Propagation

02468

1012

0 1 2 3 4 5 6 7

H LH Lp

ower

Delay

02468

1012

0 1 2 3 4 5 6 7

H LH Lp

ower

Delay

Delay-power-vdd curve for r Delay-power-vdd curve for s

Consider:Converter VddL LUTVddH LUTEdge delay

All the good solutions are generated

All the inferior solutions are pruned away0

2

4

6

8

10

12

14

16

18

0 1 2 3 4 5 6 7 8 9

Delay

power

LL

LLH

H

Delay-power-vdd curve for cluster t1

Two Theorems

The algorithm gets the minimum number of solution points, W, optimally for each node

W is upper bounded by Lwhere L = level(v) for node v

The algorithm is delay and power optimal for trees and delay optimal for directed acyclic graphs (DAGs) with dual-Vdd FPGAs

22 )1( +L

Experimental Results Summary

- 18.4%- 19.5%- 20.3%v1.3 - v1.0v1.3 - v0.9v1.3 - v0.8v1.3

Dual VddSingle Vdd

Dual-Vdd Clustering results compared to the Single high-Vdd Clustering results

v1.3 as high Vdd and v0.8 as low Vdd is the best combination among the three

Outline


ConclusionsFPGA power consumption

Majority on programmable interconnectsLeakage is significant

FPGA architecture optimization for powerArchitecture parameter tuning has a limited impactUsing high Vt for configuration SRAM cells is helpfulUsing programmable dual Vdd for logic blocks is helpful

Power-efficient FPGA architectures introduce interesting CAD problems

Dual-Vdd mappingDual-Vdd clustering

Up to 20% power saving reported using these algorithms

Documents

Architecture and Synthesis for Power-Efficient FPGAscong/slides/edp04_lp_fpga_webversion.pdfArchitecture and Synthesis for Power-Efficient FPGAs Jason Cong University of California,