Rapid Prototyping Using Field Programmable Devices

Preview:

DESCRIPTION

Rapid Prototyping Using Field Programmable Devices. Allen C.-H. Wu Department of Computer Science Tsing Hua University Hsinchu, Taiwan 30043, ROC email: chunghaw@cs.nthu.edu.tw. Outline. Introduction to programmable logic devices and rapid prototyping. - PowerPoint PPT Presentation

Citation preview

1

Rapid Prototyping Using Field Programmable Devices

Allen C.-H. WuDepartment of Computer Science

Tsing Hua UniversityHsinchu, Taiwan 30043, ROC

email: chunghaw@cs.nthu.edu.tw

2

Outline

Introduction to programmable logic devices and rapid prototyping.

FPGA design technologies and applications. Logic emulation. Reconfigurable computing and systems.

3

Part I

Introduction to Programmable Logic Devices and Rapid Prototyping

4

Programmable Logic Devices

SPLDs (simple PLDs). CPLDs (complex PLDs). FPGAs (field programmable gate arrays). SPGAs (system-programmable gate arrays).

5

Programmable Interconnect Components

FPID: I-Cube. - Dynamic switching. - Communication switches, network routes. - 32-320 programmable I/O ports. - Up to 150 MHz clock frequency.

FPIC: Aptix. - 1024 programmable I/O ports.

6

SPLD

Universal designs. Useable gates < 1,500 gates. Speed is the main advantage. 0.5um CMOS -> 3.5ns logic delays

-> 200 MHz. Market is shrinking 5-7% per year.

7

CPLD

Rising densities/performance and declining prices => become a good choice for many applications.

100K gates today, 250K gates by 1998. Low-density CPLD (32 macrocells/44 pins) -

> 5ns logic delays, high-density CPLD (128 macrocells/100 pins) -> 7.5ns.

8

FPGA

FPGA

SRAM-programmed

Antifuse-programmed

EPROM-programmed

Actel ACT1 & 2Quicklogic’s pASICCrosspoint’s CP20K

Island Cellular

Xilinx LCAAT&T OrcaAltera Flex

ToshibaPlesser’s ERAAtmel’s CLi

Altera’s MAXAMD’s MachXilinx’s EPLD

9

Categories of FPGA’s

Block organized, SRAM based. Channel organized, antifuse based. SOP organized (each logic cell likes a PAL

device), various programming techniques.

10

Block organized, SRAM based

S S S S

L

SS

L

S

L

S

L

SS

L

S

L

S

L

SS

L

S

L

S

11

SRAM Programming Technology

SRAM cell

Passtransistor

“1” -> “on”“0” -> “off”

SRAM cell Mux

i1 i2

o

“1” -> o = i1“0” -> o = i2

12

SRAM Programming Technology

Advantages: - Reprogrammability. - Quality -> parts are fully tested at the factory.

- Standard process technology. Disadvantages: - Volatile -> FPGA

must be reprogrammed each time when power is applied. - Need an external memory to store the program. - Large area (6 trs for 1 cell + 1 switch).

13

Cell Organized and Antifuse Based

L S

S

14

Antifuse Programming Technology

Poly Dielectric

DiffusionSubstrate

- Normally in high-Z state.- Can be fused to low impedance.- High-voltage melts dielectric causes link poly and diffusion.

Small antifuse area!

15

EPROM/EEPROM Technology

EPROM can be reprogrammed, no need for external storage.

EPROM can not be re-programmed in circuit.

EEPROM can be re-programmed in circuit. EEPROM consumes 2X more area as

EPROM.

16

Erasable PLD (EPLD)

Logic array Registers I/Os

Configured toD, T, JK, SR FFs.

Programmable clockto each FF.

SOP-based PAL In, Out, bidirection

17

Programming the FPGA

Configuration. Readback - design verification and

debugging. Security - a security-bit to prevent readback.

18

Advantages and Disadvantages of FPGA

Fast turnaround. Low NRE (non-recurring engineering)

changes. Low risk. Effective design verification. low testing cost. Chip size & cost. Slow speed.

19

CPLD Vs. FPGA

Interconnect style Continuous SegmentedArchitecture and timing Predictable UnpredictableSoftware compile times Short LongIn-system performance Fast ModeratePower consumption High ModerateApplications addressed Combinational and Registered

registered logic logic only

CPLD FPGA

Source: Altera

20

FPGA Selection Criteria

Density. Speed. Price. Flexibility.

21

SPGA

Allow multiple building blocks. Logic. Memory. Data path.

22

Applications Using SPGAs

Intellectual property (IP). Communication & networking. Graphical processing. Embedded processing.

23

Designing with SPGAs

A team-based approach. Understanding how to use SPGA system

features will be the key to pulling the entire design into a single device.

24

CMOS PLD Market Share

OtherCpressAT&TActelLatticeAMDAlteraXilinx

Source:dataquest

5% 3%

5%6%

11%15%

24%

31%

25

CMOS Logic Market

Std logicProgrammableGAStd cellCustomChipset

Source:dataquest

10%

9%

29%

30%

8% 14%

26

FPGAs Growth

1996 1997 1998 1999 20000

500

1000

1500

2000

2500

1996 1997 1998 1999 2000

M USD

Source: Integrated Circuit Engineering

27

CMOS Programmable-logic Market

1997 1998 1999 20000

1

2

3

4

5

1997 1998 1999 2000

B USD

Source:dataquest

28

Rapid Prototyping

What? Why? How?

29

What is prototyping?

Basic components: FPGAs and FPICs. Hardware : boards, boxes, and cabinets. Software: methodologies and CAD tools.

30

Product Development Cycle

Market survey

Product developmentCustomeracceptance

Production

31

Pressures on Today’s Product Development

Time-to-market! Design complexity!

32

Why Needs Prototyping?

Design verification. Limited production. Concurrent engineering.

33

Design Verification

Specification

Final product

Functionality & requirements

Final functionality & performance

?

34

Design Process

Simulation

Formal verification

Logic emulation

Fast prototyping

Specification

System-level design

RTL design

Logic-level design

Physical-level design

Final chips

35

Verification Alternatives

Event Driven Simulation High No Short SlowCycle-Based Simulation Med. No Short Med.Behavioral Simulation Low No Short Med.Hardware Accelerated Sim Varies No Med. Med. FastBreadboarding Med. Yes Long Very FastEmulation or Prototyping Med. Yes Med. Very Fast

Modelingaccuracy

Systemintegration

Prepare time Speed

36

A Minute in the Life of a 100K Gates Design

1 --------- Actual hardware at 50MHz10 -------- Logic emulator or prototype at 5MHz100------- 2K-------- HW accelerator at 250M evals/sec50K------- Cycle-based simulator at 1K insts/sec120K----- Compiled-code logic simulator at 125MIPs800K----- Event-driven logic simulator at 125 MIPs

1 Mon.3 Mon.1.5 Yr.

37

Development with Prototyping

HW

CHIP

SW Design Code

Fab Debug

Build IntegrationDesign

Design

Debug

Integration Debug

38

Development with Prototyping

SW Design Code

FabChip debug

Build HW Integration& DebugDesign

Design

FinalIntegrationHW

CHIP

System Integration& SW Debug

39

How to Develop a Prototyping using FPDs

Custom-designed prototyping board. Logic-emulation systems. Field-programmable printed-circuit-boards.

40

Part II

FPGA Design Technologies and Applications

41

FPGAs

What? - Programmable logic + programmable routing = FPGAs.

Why? - Zero NREs, easy bug fixes, and short time-to-market.

How?

42

Comparison of Different Design Technologies

Custom Std Cells Gate Arrays FPGAsDesign time Long Short Short ShortFabrication Long Long Short NoneChip area Small Med. Large Very largeDesign cost High Med. Low Very lowUnit cost Low Low Med. HighDesign cycle Long Med. Short Very short

43

Emerging FPGA-based Applications

Low-volume production. Urgent time-to-market competition. Rapid prototyping. Logic emulation. Custom-computing hardware. Reconfigurable computing.

44

Design Considerations

Target architecture. Fixed logic and routing resources. Fixed I/O pins. Slow signal delays.

45

An HDL-based Design Flow

HDL design specification

RTL synthesis

Logic synthesis

Physical synthesis

FPGAs

Verification(Simulation)

46

Design Specification

HDLs - VHDL and Verilog. Why needs an HDL-based design

methodology? Target Applications. Coding Styles. Design representation. Design entry.

47

Why Needs an HDL-based Design Methodology

Then NowSchematic capture

Component mapping & may be some logic optimization

Place & route

Layouts

HDL designspecification

Synthesis

Place & route

Layouts

Design complexity

SW : assembly language => high-level language

48

Target Applications and Layout Architectures

Datapath dominated designs : DSPs and processors.

Control dominated designs: controllers and communication chips.

Mixed type of designs.

Bit-sliced stacks. Standard cells. Macro-cell-based. FPGAs.

49

HDL Coding Styles Vs. Design Quality

Ideas?

HDLspec1

HDLspec2

HDLspec3

Synthesis system

Design1 Design2 Design3

50

Coding Styles and Design Representation

Hierarchical style Structural style Random style FSMD

Behavioral level Logic level Gate level

module MUX2(o,i1,i2,sel);output[1:4] o; input[1:4] i1,i2;input sel; reg[1:4] o;always case(sel) 1’b0: o = i1; 1’b1: o = i2; endcaseendmodule

module MUX2(o,i1,i2,sel);output[1:4] o; input[1:4] i1,i2;input sel; assign o[1] = ((sel&i1[1])|(~sel&i2[1])); assign o[2] = ((sel&i1[2])|(~sel&i2[2])); assign o[3] = ((sel&i1[3])|(~sel&i2[3])); assign o[4] = ((sel&i1[4])|(~sel&i2[4]));endmodule

51

RTL Synthesis

HDL compilation. Design representation. Component selection. Component generation. Resource sharing.

52

Logic Synthesis

Logic minimization. Technology dependent/independent

minimization. Technology mapping.

53

Physical Synthesis

Placement. Routing.

54

Logic Synthesis Problems for FPGAs

How to synthesize a logic network to realize a given function.

How to realize a logic network using FPGAs.

How to optimize a given network for area and timing.

How to synthesize routable circuits. How to solve these problems efficiently.

55

Representation of Boolean Functions

Truth tables. Factored forms: SOP and POS. BDD. Boolean networks.

56

Synthesis with Multiplexers

d0d1d2d3d4d5d6d7

s1 s2 s3

yBooleanequations

HOW?

57

Synthesis with Look-Up-Table (LUT)

d0d1d2d3d4d5d6d7

yBooleanequations

HOW?LUT

58

An Example

XOR(a,b) = a’b + ab’

d0d1d2d3

s0 s1

y

01

MUX

0

0

11

Dec

odera

b

RAM

59

Multilevel Logic Minimization

MIS and SIS by UC Berkeley. Optimization for timing, area, and power. Technology independent.

60

Technology Mapping for FPGAs

Technology mapping is the process of binding technology dependent circuits to technology independent circuits.

Technology mapping for FPGAs consists of two steps: (1) decomposition and (2) covering.

Technology mapper optimizes the final circuit by selecting sub-networks which are covered by LUTs.

61

Technology Mapping for FPGAs

LUTs have fixed number of inputs, k-input, which can implement logic functions up to k variables.

Nodes and sub-networks with at most k inputs in a Boolean network are referred to feasible nodes and sub-networks else infeasible.

Infeasible nodes need to be decomposed into a set of feasible nodes so that a circuit covering the network exists.

62

Technology Mapping for FPGAs

An FPGA-based technology mapper performs three tasks: 1. Decomposition - It decomposes infeasible expressions into feasible ones. 2. Reduction - It groups small expressions into CLBs to promote sharing of resources. 3. Packing - It allocates CLBs to expressions that cannot be shared.

63

Technology Mapping for FPGAs

The optimization goals for FPGA-based technology mapping include: 1. The number of CLBs, 2. The number of levels of CLB circuits, and 3. Routable designs.

64

Decomposition

Decomposition consists of three steps: 1. Identify divisors which are common to many functions. 2. Introduce the divisor as a new node. 3. Re-express existing nodes using the new nodes.

65

An Example

Given the expression f = ab’+ac’+ad’+a’b+bc’+bd’+a’c+b’c+cd’+b’d+c’d

Suppose a factor found is p = a+b+c+d

f can be re-expressed based on p: f = p(a’+b’+c’+d’)

66

Decomposition Techniques

Disjoint decomposition. Shannon cofactoring. Roth-Karp decomposition. Algebraic decomposition. AND-OR decomposition.

67

Disjoint Decomposition

Disjoint decomposition can be found by searching through all possible partitions of inputs to the infeasible nodes, and using well known methods, such as residues, to determine if each partition leads to a disjoint decomposition.

Disadvantage: the number of partitions grows exponentially with number of inputs to the infeasible nodes.

68

Shannon Cofactoring

The residue of a function f(x1,x2,..,xn) with respect to a variable xj is the value of the function for a specific value of xj. It is denoted for xj=1 and by f(xj’) for xj=0.

Ex. The residues, wrt a, of f(a,b,c,d)=ab+bc+bd’+a’cd are f(a’)=bc+bd’+cd and f(a)=b then f(a,b,c,d)=a’f(a’)+af(a)

69

Roth-Karp Decomposition

Try to decompose a function into the form: f(x,y) = g(z1(x), z2(x),..,zt(x), y) x: the bound set y: free set

Based on the concept of compatible classes. The xl_k_decomp operation in SIS for

decomposition of k-input LUTs. Computationally expensive. It is useful for

small designs with high degree of symmetry.

70

Algebraic Decomposition

Based on factored from representation and algebraic operations.

Manipulating algebraic expressions as polynomials; I.e., xi and xi’ are different variables.

To reduce search, only common cube factors are kernels are used.

Ex. x = ac+bc+bd+ce y = a+b+c and x = cy + bd

71

AND-OR Decomposition

Ensure that any infeasible node is decomposed into a set of feasible nodes.

Can be used to decompose large infeasible nodes into infeasible nodes that are small enough to make an exhaustive search for disjoint decomposition practice.

Ex. F = ab+ac+bc can be decomposed into v=ab, w=ac, x=bc, y=v+w and z=y+x

72

Covering

Graph-covering - for each node, find all the matches which cover that node. Then formulate as a covering problem.

Tree-covering - an approximation to graph covering. Since average tree size is small, optimally of tree-covering can be obtained using a dynamic programming method.

73

Covering Techniques

Decomposition-based covering using bin packing.

Covering reconvergent paths. Replication of logic at fanout nodes. Covering using edge visibility.

74

Tree-based Technology Mapping Methods

Chortle, Chortle-crf, and Chortle-d. Hydra. TM-based on edge visibility. mis-PGA.

75

Graph-Based Technology Mapping Methods

DAG-Map. Flow-Map. Area/depth trade-off.

76

Layout-Driven FPGA Synthesis

Mapping directed synthesis. Mapping with resynthesis. Combining technology mapping and

placement. Routability-driven technology mapping.

77

Performance-Driven Methods

mis-pga (xln_p) - mapping with synthesis. Logic synthesis during a timing driven placement.

M.map - interwinded mapping and placement procedures by taking into account wiring delays.

78

Routability-Driven Methods

Alternative wires - attempt to identify alternative wires and alternative functions for wires that cannot be routed due to the limited routing resources.

Balanced routing resources and cell resources by trading off the routability with the compactness of a design. Try to deliver routable designs by controlling directly the pins-per-cell ratio of the design.

79

Sequential Synthesis for FPGAs

Each CLB has two FFs. Not much work has done in this area.

WHY? Two attempts were conducted by the UCB

group: map combinational and sequential circuits simultaneously and separately.

How the Xilinx’s APR handles the sequential circuits?

80

Placement

CLB netlist

Assign logic to cells

S S S S

L

SS

L

S

L

S

L

SS

L

S

L

S

L

SS

L

S

L

S

81

Routing

S S S S

L

SS

L

S

L

S

L

SS

L

S

L

S

L

SS

L

S

L

S

Realized interconnection by turning onswitches of routing resources.

82

Placement & Routing Methods

Placement - simulated annealing is the commonly used method.

Routing - routability-driven and timing-driven.

Time-consuming design tasks. Architectural dependent.

83

HDL-based Design Flow for Multi-FPGA Designs

HDL description

HDL synthesis

Netlists

Partitioning

Partitioned netlists

84

Basic Partitioning Techniques

The min-cut partitioning: . The Kernighan-Lin algorithm. . The Fiduccia and Mattheyses algorithm. . The Krishnamurthy algorithm.

The ratio-cut algorithm. A variety of clustering algorithms.

85

Multi-FPGA Partitioning

Constraints: 1. Fixed number of I/O pins. 2. Fixed number of CLBs. 3. Utilization.

Objectives: 1. Cost minimization. 2. Delay minimization.

86

Circuit-Level Partitioning Methods

Multiway partitioning methods based on the min-cut algorithm.

Interconnect minimization by cell replication.

Clustering-based partitioning methods - cone.

Combining top-down partitioning and bottom-up clustering methods.

87

Considerations for Multi-FPGA Partitioning

Limited IO-pin and logic resources. Logic utilization is predominated by IO-pin

limitation. How to alleviate the IO-limitation problem

is the key to improve the logic utilization of FPGA chips.

88

Combining HDL Synthesis and Partitioning

Bridging HDL synthesis and partitioning?

HDL description

HDL synthesis

Netlists

Partitioning

Partitioned netlists

89

Design Considerations

HDL Spec.

Application-Oriented SynthesisModule-based

Bit-sliced Function-based

Fine-grained

Varying coding styles

Datapath-dominatedControl-dominated

90

Coding Styles

Top

M1 M2 M11 M12 M21 M22

Top

Mod1

Mod1_1

Mod1_2

Mod2

Mod2_1

Mod2_2

Top Top

M1 M2

M11 M12 M21 M22

91

The FSMD Coding Style

CU

CU1

CU2

DP

DP1

DP2

Top Top

CU DP

CU1 CU2 DP1 DP2

92

Integrated HDL-Synthesis and Partitioning Methodology

HDL descriptions

Module-basedHDL synthesis

Fine-grainedHDL synthesis

Bit-sliced-based HDL synthesis

Circuit-levelpartitioning

Covering-based partitioning

Bit-sliced-based partitioning

P&R

FPGAs

93

Module-based HDL Synthesis

Top

M1 M2 Mn

94

Fine-Grained HDL Synthesis

Top

M1 M2 Mn

P1 Pm

F1 F2

Clusters

95

A Process ExampleProcess{P1}input[0:3] i1,i2;input i3;output[0:3] o1;output o2;o1 = i1 + i2;o2 = i1[0] & i3;

i1i2 i3

4

4

+

o1 o2

&

4

P1

o1[0] o1[3]

o2

f1.0 f1.3

f2

o1

96

Functional-based Clustering

DesignModule{M1}

Process{P1}

Process{P2}

Module{M2}

Design

M1 M2

P1 P2

f1 f2

97

Bit-Sliced-Based Synthesis

Mux[0:7] Mux[0:5]MuxMux

Adder

[0] [5] [7]

Adder[0:7]

98

Functional Clustering

Mux

Mux

Adder

DP[0]

[7]

[5]

[7]

[0]

[0]

DP[0]

DP[7]

DPMux[0]

Mux[0]

Mux[7]

Adder[7]

Adder[0]

99

Part III

Logic Emulation

100

What is a Logic Emulation System

A programmable hardware built with programmable logic and programmable interconnect devices.

A software which automatically programs the hardware according to the circuit under design.

Control HW/SW to support operation of the emulated design as a hardware component operating in real time.

101

Target System

Typical Logic Emulation Environment

Workstation

Logic EmulatorLogic Module

Probe Module

In-circuitInterface

Compiler, runtime software

Stimulus generator, logic analyzer

102

Why needs Logic Emulation

Design verification issues. Real-time operation. System-level testing. Rapid prototyping.

103

Design Verification Issues

Simulation-based verification methods have run out of stem when chip complexity grows.

Emulation is a verification technology that grows along with design size.

104

Real-Time Operation

Simulation requires test vector development which is costly and difficult. Verification depends on test vector correctness.

Certain applications must be verified in real time - human perception: audio and video.

Emulation connected to actual hardware can run: real diagnostic code, operating systems, and applications.

105

System-Level Testing

Often the chip meets spec but fails in the system.

System-level interactions between the chip and other components.

Internal probing is impossible when the chip is fabbed and placed in a system, but it is possible using emulation.

106

Rapid Prototyping

Once emulated design is debugged it is available for immediate use by software developers for software debugging.

Emulated design is available for demo and experiments with architecture on real applications and data.

107

Programmable Hardware

Programmable interconnect

Memoryelement

VLSI core

Interface Logicelement

Logicelement

108

Considerations

The capacity of logic and interconnection depends on package constraints. This forces a hierarchical system. Chips => boards => boxes => system

The interconnect structure must: 1. Provide successful connectivity, 2. Maximize FPGA utilization, and 3. Minimize delay and skew.

Rent’s rule applies to predict interconnect needs.

109

Multi-FPGA Systems

Topologies: - Mesh - nearest neighboring. - Crossbar - full and partial.

Interconnect scheme: - Circuit switched. - Time multiplexed.

110

Nearest Neighbor Interconnection

FPGA FPGA FPGA

FPGA FPGA FPGA

FPGA FPGA FPGA

111

Advantages and Disadvantages

Advantages: - Uniform: all chips the same. - Easy to lay out on PCB.

Disadvantages: - Routing is easily blocked. - Through pins limit logic utilization of FPGAs. - Long and unpredictable delays. - No natural hierarchical extension.

112

Nearest Neighbor Extensions

FPGA FPGA FPGA

FPGA FPGA FPGA

FPGA FPGA FPGA

113

Advantages and Disadvantages

Advantages: - More choices for router by adding diagonal lines & skip lines.

Disadvantages: - More complex PCB. - More complex routing software.

114

Partial Crossbar Interconnect

A B C D A B C D A B C D A B C D

A pins B pins C pins D pins

Logic blocks

Crossbars

Second-level crossbars

115

Partial Crossbar Interconnect

Partial crossbar consists of a set of small full crossbars, connected to logic blocks but not to each other.

I/O pins of each FPGA are divided into subsets. Each subset is connected by a full crossbar circuit switch.

Partial crossbar is a potentially blocking network.

116

Partial Crossbar Characteristics

Partial crossbar’s size is proportional to the number of FPGA pins.

All interconnections go through one/three crossbar chips for a one-level/two-level partial crossbar interconnect - delays are uniform and bounded.

117

Mixed Full and Partial Crossbar

FPGA

LocalFPIC

Global FPIC

Global FPIC

LocalFPIC

LocalFPIC

FPGA FPGAFPGAFPGA FPGA

Externalconnections

Partialcrossbar

Full crossbar

118

Circuit Switched Vs. Time Multiplexed

Trade off operating speed and hardware cost. Time-multiplexing method: - can

greatly expand available interconnect. - allows lower cost IC package and PCB. - makes partitioning easier. BUT - System power increases due to frequent signal switching (higher hardware cost). - Complex scheduling software. - Slow operating speed.

119

Virtual Wires

FPGA FPGAPhysical wires

Logicaloutputs

Logical inputs

FPGA FPGA

Mux

Mux

120

Logic Emulation Systems

System with mesh topology - Quickturn’s RPM and Virtual Machine Works (IKOS).

System with partial crossbar - Quickturn’s Enterprise, Mars, and System Realizer.

System with mixed full and partial crossbar - Aptix Prototyping System.

System using time-multiplexed interconnect - Virtual Machine Works (IKOS) , CoBALT and Arkos (Quickturn).

121

Memory Solutions

Goal: programmable memories with different width/depth/port combinations.

FPGA-based memories: - inefficient of using logic resources. - timing correctness is difficult to be insured. - large or highly multi-ported memories must be partitioned across several FPGAs.

SRAMs with dedicated or programmable controllers.

122

Logic Emulation Design Flow

Pre-configuration preparation

Full-chipconfiguration

In-circuitemulation

HDL synthesis

Synthesis

Partitioning

System mapping

P & R

Design downloading

Emulators

123

Logic Emulation Design Compiler

Logic emulation design compiler is a large and complex EDA tool which includes: - Front-end design importer. - HDL-based synthesizer. - Clock and timing analyzer. - Partitioner. - System-level placer and router. - FPGA-based placer and router.

124

Objectives

Fast compilation time. Fast emulation clock. Timing correctness. Easy ECO. Minimize circuit size.

125

Design Considerations

HDL synthesis: - Trade-off run-time and quality. - CLB-based Vs. gate-based designs.

Clock and timing analysis: - Timing correctness, hold-time violation free. - Clock skew minimization.

Partitioning: - Run time. - Timing and area.

126

Design Considerations

System placement and routing: - Timing. - Completeness of routing.

FPGA-based placement and routing: - Fast run time. - Parallel compilation.

127

Hold-Time Violation

Hold-time violation occurs when Routing delay > LUT delay!!!

D Q

CK

D Q

CKLUT

CLB

Routing delay

Clock distribution problem (Skew)!!!

128

Timing Correctness

D Q

CK

D Q

CKLUT

CLB

Routing delay

Delayelement

Delay insertion

129

Timing Correctness

D Q

CK

D Q

CKLUT

CLB

Clock path

CE

Primary clock Low-skew net

Use clock enables for gated clocks

130

Methodology

Pre-configuration preparation - prepare netlists and control files for configuration.

Testbed preparation - prepare emulation-based operation environment.

Full-chip configuration - download design to the emulator.

In-circuit emulation - test the design.

131

Pre-Configuration

Translate the leaf-cell libraries into emulation primitives.

Translated libraries must be verified for functional equivalence to original.

Modify and redesign some components to attain compatibility with emulation techniques, such as precharge logic circuits.

Assemble all the gate-level netlists for the entire design.

132

Testbed

Design and implement target ICE board combining the emulated design with real hardware.

Slowdown testbed to emulation speed. Assemble the testbed and emulation

equipment.

133

Full-Chip Configuration & In-Circuit Emulation

Full-chip configuration: - Prepare control files. - Partition the design to fit into the emulation system. - Download design into the system. - Verify that emulation model faithfully implements the design as specified by RTL.

In-circuit emulation

134

Part IV

Reconfigurable Computing and Systems

135

General-Purpose Computing Vs. Custom Computing

General-purpose computing - applying applications on a general-purpose computer.

Custom computing - applying applications on a custom-made application-specific hardware.

Field-programmable devices make this into a reality.

136

Goals of Reconfigurable Computing

Tailor the architecture to the application. Minimize or eliminate instruction

interpretation. Exploit fine grained parallelism. Map software to hardware.

137

Applications

Database search and analysis. Image processing and machine vision. Data compression. Signal processing. Neural networks. Biology computing. Medical computing. Many more.

138

ROM

Application 1

Multi-Mode Systems

Reconfigurable system

- Different configurations for read & write operations of a tape driver (Honeywell).- Different configurations for different printer controllers (Tektronix).

Application 2

139

Run-Time Reconfiguration

Jeep?

Tank?

I/OTruck?Image data

?

- Break single computation into multiple pieces.- Page in components as needed (virtual hardware), ex., automatic target recognition.

140

Custom Computing

Application-specific systems. Numerous applications for similar

reconfigurable systems. Offers hardware performance, flexibility to

handle numerous algorithms. Multi-FPGA systems can be viewed as

hardware supercomputers.

141

Reconfigurable Ceprocessors

Processor

Coprocessor

Program 1

Inst1

Program 2

Inst2- Provide custom instructions on a per-application basis.

142

Types of Reprogrammable Systems

Coprocessor

CPU

Attachedprocessing unit

Memory caches

I/Ointerface

Standalone PU

143

Types of Reprogrammable Systems

Attached and standalone processing units are reprogrammable systems on computer add-on cards and separate reprogrammable cabinets. Considerations: large communication overhead may over-shadow the speed gain.

Application-specific coprocessors can achieve significant improvement over a wide range of applications.

144

Types of Reprogrammable Systems

Integrate the reprogrammable logic into the processor itself. - A reprogrammable functional unit can be configured on a per-algorithm basis. - Providing some special-purpose instructions tailored to the needs of a given application.

145

Architectures of Multi-FPGA Systems

The most commonly used topologies: - Mesh: 1D (linear array), 2D, and 3D. - Crossbar: full, partial, mixed, and hierarchical. - Hybrid between mesh and crossbar. - Application-specific architecture.

146

Hybrid Topology

Splash 2: augments a linear array of FPGAs with a crossbar switch.Goal: Supporting systolic circuits.

RAM

FPGA

RAM

FPGA

RAM

FPGA

RAM

FPGA

FPGA

16 FPGAs

Ext. InterfaceExt. Interface

147

Hybrid Topology

FPGA

RAM

FPGA

RAM

FPGA

RAM

FPGA

Hostinterface

Anyboard: A linear array of FPGAs augmented by global buses.

148

Hybrid Topology

4 X 4 meshof FPGAs

RAM

RAM

RAM

RAM Hostinterface

DECPeRLe-1: a 4 X 4 mesh of FPGAs augmented with shred global buses.

149

Application-Specific Topology

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

FPU

Memory

11

1

1

4 5 2 3

4 5 2 3

4 5 2 3

The Marc-1: subsystem 1.

150

Application-Specific Topology

1

5

4

3

2Subsystem1

Subsystem1

The Marc-1Target to circuitsimulation wherethe program to beexecuted can beoptimized on aper-run basis forvalues constantwithin that run, but which may vary from dataset to dataset.

151

Application-Specific Topology

FPGA

RAM

FPGA

RAM

FPGA

RAM

FPGA

RAM

FPGA

RAM

FPGA

RAM

FPGA

RAM

FPGA

RAM

FPGA

RAMThe RM-nc system: neural network.

152

Architecture for Computer Prototyping

FPGA

FPGA

FPGA

FPGAFPGA

FPGAFPGACache memory

Register file

ALU FPU

VME bus

The Mushroom processorprototyping system.

153

Expandable Topology

Hierarchical crossbar topology: by adding extra level. - Quickturn systems.

Expandable mesh topology: by connecting individual board to form a large mesh. - The Virtual Wires Emulation System (IKOS).

154

Topology for Adapting Other Components

Many multi-FPGA systems include non-FPGA resources to provide more general purpose solutions.

The MORRPH system - sockets next to FPGAs which allow to add arbitrary devices to the array.

The G800 board - contains two FPGAs and four sockets.

155

Topology for Adapting Other Components

The COBRA system - contains based modules (expanding to 2D mesh), RAM modules, I/O modules, and bus modules.

The Springbok system - pre-made daughter board which is able to contain an arbitrary device (on the top) and an FPGA (on the bottom). Daughter boards is mounted on a baseplate.

156

Topology for Adapting Other Components

The Quickturn systems - external component adapters.

The Aptix FPCB - a reprogrammable PCB.

157

Design Methodology

Applications

Hostcomputer

Reprogrammable system

Mapping

158

Typical Software Methodology

Application spec.

Analysis System-level synthesis

Software spec.

Codegeneration

Object codeHardware synthesis

Hardware spec.

159

Typical Software Methodology

Hardware spec.

Synthesis

Partitioning & placement

Pin assignment & routing

FPGA P & R

Bit-stream files

160

Considerations

Architectural-specific design tasks. Design automation process. The mapping time dominates the setup time

for operating the system. Run-time reconfigurability.

161

Design Specification and Languages

Standard software programming languages, e.g., C, C++, FORTRAN, and assembly language, Vs. HDLs.

Standard software programming languages - a sequential execution model.

HDLs - a parallel execution model. Who will use it and which one is more

suitable for system description???

162

Compilation Issues

Translate code from software languages into hardware without losing the inherent concurrency of hardware.

Compiler techniques for parallelizing code. Straight-line code, control flow, and loops. Transmogrifier C compiler.

163

System-level and High-level Synthesis

System-level design evaluation and analysis. Design estimation. Hardware-software partitioning. Interface synthesis. RTL synthesis. Logic synthesis and technology mapping.

164

Partitioning and Placement

Topology-aware partitioning methods. Partitioning onto a multi-FPGA system is

equivalent to a placement problem. Logic utilization and timing.

165

Pin Assignment and Routing

Pin-assignment - the process of determining which I/O pins to be used for each inter-FPGA signal.

Pin-assignment for a pre-fabricated multi-FPGA system is equivalent to the global routing problem.

Pin-assignment will greatly affect the quality of FPGA’s logic utilization and routability.

166

Run-Time Reconfigurability

Virtual hardware <=> virtual memory. Hardware on demand. Unconfigured and reconfiguring methods. Software supporting time-varying mapping. Many open problems need to be solved in

the forth coming years.

167

Applications: Splash 2

Stream oriented systolic and SIMD applications. Scalable linear array of 16 to 256 processing

elements (1 XC4010 with 1/2 Mbyte). VHDL based. Sequence comparison - 2300M:0.75M cell

updates/sec (Splash 2:Sparc 10). Edge detection - 10M:242K pixels/sec (Splash

2:Sparc 10).

168

Applications: PAM (DEC)

Programmable Active Memory (PAM). C++ based and mesh arrays of XC3090

(DECPeRLe-1). Applications: -

Multiple precision arithmetic. - RSA encryption. - Video compression (JPEG, MPEG, DCT). - High energy physics. - Telecommunications.

Recommended