97
UC Regents Fall 2010 © UCB CS 250 L10: Design Verification 2010-10-11 John Wawrzynek and Krste Asanovic with John Lazzaro CS 250 VLSI System Design Lecture 10 Design Verification www-inst.eecs.berkeley.edu/~cs250/ TA: Yunsup Lee 1

CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

2010-10-11John Wawrzynek and Krste Asanovic

with John Lazzaro

CS 250 VLSI System Design

Lecture 10 – Design Verification

www-inst.eecs.berkeley.edu/~cs250/

TA: Yunsup Lee

1

Page 2: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

multi-site team, necessitating the development of ways tosynchronize the design environment and data (as well asthe design team).

In the following sections of this paper, the designmethodology, clock network, circuits, power distribution,integration, and timing approaches used to meet thesechallenges for the POWER4 chip are described, andresults achieved for POWER4 are presented.

Design methodologyThe design methodology for the POWER4 microprocessorfeatured a hierarchical approach across multiple aspects ofthe design. The chip was organized physically and logically

in a four-level hierarchy, as illustrated in Figure 2.

typically containing 50 000 transistors. Units compriseapproximately 50 related macros, with the microprocessorcore made up of six units. The highest level is the chip,which contains two cores plus the units associated with theon-chip memory subsystem and interconnection fabric.This hierarchy facilitates concurrent design across all fourlevels. While the macros (blocks such as adders, SRAMs,and control logic) are being designed at the transistor and

Figure 1

POWER4 chip photograph showing the principal functional units in the microprocessor core and in the memory subsystem.

Figure 2

Elements in the physical and logical hierarchy used to design the POWER4 chip.

Core Core

Chip

Core

FPU FXU

IFU Unit F

Unit F

Unit F

Macro 1

Unit ZUnit X

Macro n

Macro n

Macro 1

Macro 3

Macro 2

Unit A

Macros, units, core,and chip all generateinitial timing andfloorplan contracts

Memory subsystem

Table 1 Features of the IBM CMOS 8S3 SOItechnology.

Gate Leff 0.09 !mGate oxide 2.3 nm

Metal layers pitch thicknessM1 0.5 !m 0.31 !mM2 0.63 !m 0.31 !mM3–M5 0.63 !m 0.42 !mM6 (MQ) 1.26 !m 0.92 !mM7 (LM) 1.26 !m 0.92 !m

Dielectric "r !4.2

Vdd 1.6 V

Table 2 Characteristics of the POWER4 chip fabricatedin CMOS 8S3 SOI.

Clock frequency ( fc) "1.3 GHzPower 115 W (@ 1.1 GHz, 1.5 V)Transistors 174,000,000

Macros (unique/total) 1015 4341Custom 442 2002RLM 523 2158SRAM 50 181

Total C4s 6380Signal I/Os 2200I/O bandwidth "500 Mb/s

Bus frequency 1/2 fcEngineered wiresBuffers and invertersDecoupling cap 300 nF

J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002

28

35K

The smallest members of the hierarchy are “macros”

100K

IBM Power 4174 Million TransistorsA complex design ...

96% of all bugs were caught before first tape-out.

First silicon booted AIX & Linux, on a 16-die system.

How ???2

Page 3: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Three main components ...(1) Specify chip behavior at the RTL level, and comprehensively simulate it.

(2) Use formal verification to show equivalence betweenVerilog RTL and circuit schematic RTL.

(3) Technology layer: do the the electrons implement the RTL, at speed and power?

Today, we focus on (1).

3

Page 4: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Lecture Focus: Functional Design Test1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

testing goal

The processor design

correctly executes programs

written in the Instruction Set Architecture

Not manufacturing

tests ...

Intel XScale ARM Pipeline, IEEE Journal of Solid State Circuits, 36:11, November 2001

“Correct” == meets the

“Architect’s Contract”

4

Page 5: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Architect’s “Contract with the Programmer”

To the program, it appears that instructions execute in the correct order defined by the ISA.

What the machine actually does is up to the hardware designers, as long as the contract is kept.

As each instruction completes, thearchitected machine state appears to the program to obey the ISA.

5

Page 6: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Three models (at least) to cross-check.1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

The Verilog RTL model

Chip-level schematic RTL

The “contract” specification“The answer” (correct, we hope).Simulates the ISA model in C. Fast.Better: two models coded independently.

Logical semantics of the Verilog model we will use to create gates. Runs on a software simulator or FPGA hardware.

Catch synthesis bugs. Formally verify netlist against Verilog RTL. Also used for timing and power.

Where do bugs come from?

6

Page 7: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Where bugs come from (a partial list) ...1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

The contract is wrong.You understand the contract, create a design that correctly implements it, write correct Verilog for the design ...

The contract is misread.Your design is a correct implementation of what you think the contract means ... but you misunderstand the contract.

Conceptual error in design.You understand the contract, but devise an incorrect implementation of it ...

Verilog: name misspellings, latch implication, combinational loops.

Verilog coding errors.You express your correct design idea in Verilog .. with incorrect Verilog semantics.

7

Page 8: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Four Types of Testing

8

Page 9: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

Big Bang: Complete Processor TestingTop-down

testing

Bottom-uptesting

how it works

Assemble the complete

processor.

Execute test program suite

on theprocessor.

Check results.

complete processor

testing

Checks contract model against Verilog RTL. Test suite runs the gamut from “1-line programs” to “boot the OS”.

9

Page 10: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

Methodical Approach: Unit Testing

complete processor

testing

Requires writing a bug-free “contract model”

for the unit.

Top-downtesting

Bottom-uptesting

Remove a block from the

design.

Test it in isolation against

specification.unit testing

how it works

10

Page 11: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

Climbing the Hierarchy: Multi-unit Testing

complete processor

testing

Choice of partition determines if the test moves

the project forward.

unit testing

Top-downtesting

Bottom-uptesting

Remove connected

blocksfrom design.

Test in isolation against

specification.

multi-unit testing

how it works

11

Page 12: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

Top-downtesting

Bottom-uptesting

Processor Testing with Self-Checking Units

complete processor

testing

unit testing

multi-unit testing

how it works

Add self-checking

to units

Perform complete processor

testing

processortesting

withself-checks

Self-checks are unit tests built into CPU, that generate the “right answer” on the fly.

Slower to simulate.

12

Page 13: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Testing: Verification vs. Diagnostics

complete processor

testing

Top-downtesting

Bottom-uptesting

unit testing

multi-unit testing

processortesting

withself-checks

Diagnosis of bugs found during “complete processor” testing is hard ...

Verification:A yes/no answer to the question “Does the processor have one more bug?”

Diagnostics:

Clues to help find and fix the bug.

13

Page 14: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

“CPU program” diagnosis is tricky ...

Observation: On a buggy CPU model, the correctness of every executed instruction is suspect.

Consequence: One needs to verify the correctness of instructions that surround the suspected buggy instruction.

Depends on (1) number of “instructions in flight” in the machine, and (2) lifetime of non-architected state (may be “indefinite”).

14

Page 15: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

State observability and controllability

complete processor

testing

Top-downtesting

Bottom-uptesting

unit testing

multi-unit testing

processortesting

withself-checks

Observability:Does my model expose the state I need to diagnose the bug?

Controllability:Does my model support changing the state value I need to change to diagnose the bug?

Support != “yes, just rewrite the model code”!

15

Page 16: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Writing a Test Plan

16

Page 17: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

The testing timeline ...

complete processor

testing

Top-downtesting

Bottom-uptesting

unit testing

multi-unit testing

processortesting

withself-checks

processorassemblycomplete

correctlyexecutes

singleinstructions

correctlyexecutes

shortprograms

Time

Epoch 1 Epoch 2 Epoch 3 Epoch 4

Plan in advance what tests to do when ...

17

Page 18: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

An example test plan ...

processorassemblycomplete

correctlyexecutes

singleinstructions

correctlyexecutes

shortprograms

Time

Epoch 1 Epoch 2 Epoch 3 Epoch 4unit testingearly

multiunit

testinglater

processortesting

withself-checks

multi-unit testing

unit testing

diagnostics

complete processor

testingverification

processortesting

withself-checks

diagnostics

processortesting

withself-checks

multi-unit testing

unit testing

diagnostics

complete processor

testing

Top-downtesting

Bottom-uptesting

unit testing

multi-unit testing

processortesting

withself-checks

18

Page 19: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Unit Testing

19

Page 20: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Combinational Unit Testing: 3-bit Adder

3A

3B3 Sum

Cout

Cin

+

Number of input bits ? 7

Total number of possible input values?

27 = 128

Just test them all ...Apply “test vectors”0,1,2 ... 127 to inputs.

100% input space “coverage”“Exhaustive testing”

20

Page 21: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Combinational Unit Testing: 32-bit Adder

32A

32B

32 Sum

Cout

Cin

+

Number of input bits ? 65

Total number of possible input values?

3.689e+19

Just test them all?Exhaustive testing does not “scale”.

“Combinatorial explosion!”

265

=

21

Page 22: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Test Approach 1: Random Vectors

32A

32B

32 Sum

Cout

Cin

+

how it works

Apply randomA, B, Cin to adder.

Check Sum, Cout.

Bug curve.

Bugs found per minute of testing

Time

Bug Rate

When to stop testing?

How? Use $random to set inputs to the testbench.

22

Page 23: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Test Approach 2: Directed Vectors

32A

32B

32 Sum

Cout

Cin

+

how it works

Hand-craft test vectors

to cover“corner cases”

A == B == Cin == 0

“Black-box”: Corner cases based on functional properties.

“Clear-box”: Corner cases based on unit internal structure.

Power Tool:

Directed

Random

23

Page 24: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

State Machine Testing

CPU design examplesDRAM controller state machines

Cache control state machines Branch prediction state machines

24

Page 25: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Testing State Machines: Break Feedback

Next State Combinational Logic

ChangeRst

YGD QD Q D QR

Isolate “Next State” logic. Test as a combinational unit.

Easier with certain Verilog coding styles ...

25

Page 26: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Testing State Machines: Arc Coverage

Change == 1

Change == 1 Change == 1R Y G1 0 0

R Y G0 0 1

R Y G0 1 0

Rst == 1

Force machine into each state. Test behavior of each arc.

Intractable for state machines with high edge density ...

26

Page 27: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Regression Testing

Or, how to find the last bug ...

27

Page 28: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Writing “complete CPU” test programs

processorassemblycomplete

correctlyexecutes

singleinstructions

correctlyexecutes

shortprograms

Time

Epoch 1 Epoch 2 Epoch 3 Epoch 4processor

testingwith

self-checks

complete processor

testingprocessor

testingwith

self-checks

complete processor

testing

Top-downtesting

Bottom-uptesting

unit testing

multi-unit testing

processortesting

withself-checks

Single instructions with directed-random field values.

White-box “Instructions-in-flight” sized programs that stress design.

Tests that stress long-lived non-architected state.Regression testing: re-run subsets of the test

library, and then the entire library, after a fix.28

Page 29: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

2010-10-11John Wawrzynek and Krste Asanovic

with John Lazzaro

CS 250 VLSI System Design

Lecture 10 – Pipeline Micro-architecture

www-inst.eecs.berkeley.edu/~cs250/

TA: Yunsup Lee

29

Page 30: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Pipelining Basics

30

Page 31: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Starting Point: Performance Equation

SecondsProgram

InstructionsProgram

= SecondsCycle

CPI: The average number of clock

Cycles Per Instruction For the Program

InstructionCycles

Rationale: Every additional instruction you execute takes time.

Rationale: By shortening the period for each cycle, we shorten execution time.

Different programs have different CPIs, for a variety of reasons.

31

Page 32: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Consider machine with a data cache ...

InstructionsProgram

= SecondsCycle

A program’s load instructions “stride”

through every memory address.

The cache never “hits”, so every load goes to

DRAM (100x slower than loads that go to cache).

Thus, the average number of cycles for load instructions is higher for this program.

InstructionCycles

Thus, the average number of cycles for all instructions is higher for this program.

SecondsProgram

Thus, program takes longer to run!

32

Page 33: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Starting Point: Single-cycle processor

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Dout

Data Memory

WE

Din

Addr

MemToReg

Addr Data

Instr

Mem32A

L

U

32

32

op

Ext

SecondsProgram

InstructionsProgram

= SecondsCycle Instruction

Cycles

CPI == 1This is good.

Slow.This is bad.

Challenge: Speed up clock while keeping CPI == 1

33

Page 34: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Observation: Logic idle most of cycle

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Dout

Data Memory

WE

Din

Addr

MemToReg

Addr Data

Instr

Mem32A

L

U

32

32

op

Ext

For most of cycle, ALU is either “waiting” for its inputs, or “holding” its output

Ideal: a CPU architecture where each part is always “working”.

34

Page 35: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Inspiration: Automobile assembly lineAssembly line moves on a steady clock.

Each station does the same task on each car.Car

body shell

Car chassis

Mergestation

Boltingstation

The clock

35

Page 36: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Inspiration: Automobile assembly lineSimpler station tasks → more cars per hour.Simple tasks take less time, clock is faster.

36

Page 37: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Inspiration: Automobile assembly lineLine speed limited by slowest task.

Most efficient if all tasks take same time to do

37

Page 38: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Inspiration: Automobile assembly lineSimpler tasks, complex car → long line!

These lines go 24 x 7, and rarely shut down.

38

Page 39: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Key analogy: The instruction is the car

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR IR IR

Instruction Fetch

IR

Pipeline Stage #1 Stage #2

Controlshardware

in stage 2

Stage #3

Controlshardware

in stage 3

Stage #4

Controlshardware

in stage 4

Stage #5

Controlshardware

in stage 5

“Data-stationary control”

39

Page 40: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Example: Decode & Register Fetch stage

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR

Instr Fetch

Pipeline Stage #1

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

Ext

IR

B

A

M

Stage #2

Decode & Reg Fetch

IR

Stage #3

ADD R4,R3,R2OR R7,R6,R5SUB R10, R9,R8

ADD R4,R3,R2OR R7,R6,R5SUB R10,R9,R8

A sample program

R’s chosen so that instructions are

independent - like cars on the line.

40

Page 41: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Decode & Reg Fetch

Performance Equation and Pipelining

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch Stage #3

SecondsProgram

InstructionsProgram= Seconds

Cycle InstructionCycles

To get shortest clock period,

balance the work to do in each

pipeline stage.

CPI == 1Once pipe is fill,one instructioncompletes per

cycle

Clock period is shorter

Less work to do in each cycle

41

Page 42: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Hazards: An instruction is not a car ...

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch

Stage #1 Stage #2 Stage #3

Decode & Reg Fetch

ADD R4,R3,R2OR R5,R4,R2

An example of a “hazard” -- we must

(1) detect and (2) resolve all hazards

to make a CPU that matches ISA

R4 not written yet ...... wrong value of R4 fetched from RegFile, contract with programmer broken! Oops! ADD R4,R3,R2

OR R5,R4,R2

New sample program

42

Page 43: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Decode & Reg Fetch

Performance Equation and Hazards

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch Stage #3

SecondsProgram

InstructionsProgram= Seconds

Cycle InstructionCycles

Some ways to cope with hazards

makes CPI > 1“stalling pipeline”

Added logic to detect and resolve hazards increases

clock period

43

Page 44: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

A (simplified) 5-stage pipelined CPU

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR

B

A

M

Instr Fetch

“IF” Stage “ID/RF” Stage

Decode & Reg Fetch

1 2

“EX” StageExecution

32A

L

U

32

32

op

IR

Y

M

3

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

“MEM” StageMemory

WE, MemToReg

4WB5

WriteBack

Mux,Logic

44

Page 45: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Visualizing Pipelines

45

Page 46: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Pipeline Representation #1: Timeline

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

ADD R4,R3,R2

OR R7,R6,R5

SUB R1,R9,R8XOR R3,R2,R1

AND R6,R5,R4I1:I2:I3:I4:I5:

Sample Program

IF ID

IF

EX

ID

IF

MEM

EX

ID

IF

WB

MEM

EX

IFID

WB

MEM

IDEX

IF

WB

EXMEM

IDMEMWB

EX

I1:I2:I3:I4:I5:

t1 t2 t3 t4 t5 t6 t7 t8Time:Inst

I6:

Good for visualizing pipeline fills.

Pipeline is “full”

46

Page 47: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Pipeline is “full”

Good for visualizing pipeline stalls.

Representation #2: Resource Usage

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR IR IR IR

ADD R4,R3,R2

OR R7,R6,R5

SUB R1,R9,R8XOR R3,R2,R1

AND R6,R5,R4I1:I2:I3:I4:I5:

Sample Program

I1 I2

I1

I3

I2

I1

I4

I3

I2

I1

I5

I4

I3

I1I2

IF:ID:EX:MEM:WB:

t1 t2 t3 t4 t5 t6 t7 t8Time:Stage

IF (Fetch) ID (Decode) EX (ALU) MEM WB

I5

I4

I2I3

I6

I5

I3I4

I6

I7

I4I5

I6

I7

I8

47

Page 48: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Data and Control Hazards

48

Page 49: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Data Hazards: 3 Types (RAW, WAR, WAW)

Several pipeline stages read or write thesame data location in an incompatible way.

Read After Write (RAW) hazards.Instruction I2 expects to read a datavalue written by an earlier instruction,but I2 executes “too early” and readsthe wrong copy of the data.

Note “data value”, not “register”. Data hazards are possible for any architected state (such as main memory). In practice, main memory hazard avoidance is the job of the memory system.

49

Page 50: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Recall: RAW example

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch

Stage #1 Stage #2 Stage #3

Decode & Reg Fetch

ADD R4,R3,R2OR R5,R4,R2

R4 not written yet ...... wrong value of R4 fetched from RegFile, contract with programmer broken! Oops!

ADD R4,R3,R2OR R5,R4,R2

Sample program

This is what we mean

when we say Read After

Write (RAW) Hazard

50

Page 51: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

I1:I2:I3:I4:I5:

t1 t2 t3 t4 t5 t6 t7 t8Time:Inst

I6:

Control Hazards: A taken branch/jump

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

BEQ R4,R3,25

SUB R1,R9,R8AND R6,R5,R4

I1:I2:I3:

Sample Program(ISA w/o branch delay slot) IF ID

IF

EX

ID

IF

MEM WBEX stage computes if branch is taken

Note: with branch delay slot, I2 MUST complete, I3 MUST NOT complete.

If branch is taken, these instructions

MUST NOT complete!51

Page 52: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Hazard Resolution Tools

52

Page 53: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

The Hazard Resolution Toolkit

Stall earlier instructions in pipeline.

Kill earlier instructions in pipeline.

Forward results computed in later pipeline stages to earlier stages.Add new hardware or rearrange hardware design to eliminate hazard.

Make hardware handle concurrent requests to eliminate hazard.

Change ISA to eliminate hazard.

53

Page 54: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Resolving a RAW hazard by stalling

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch

Stage #1 Stage #2 Stage #3

Decode & Reg Fetch

ADD R4,R3,R2OR R5,R4,R2

Let ADD proceed to WB stage, so that R4 is written to regfile.

ADD R4,R3,R2OR R5,R4,R2

Sample programKeep executingOR instructionuntil R4 is ready.Until then, sendNOPS to IR 2/3.

Freeze PC and IR until stall is over.

New datapath hardware

(1) Mux into IR 2/3to feed in NOP.

(2) Write enable on PC and IR 1/2

54

Page 55: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

The Hazard Resolution Toolkit

Stall earlier instructions in pipeline.

Kill earlier instructions in pipeline.

Forward results computed in later pipeline stages to earlier stages.Add new hardware or rearrange hardware design to eliminate hazard.

Make hardware handle concurrent requests to eliminate hazard.

Change ISA to eliminate hazard.

55

Page 56: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR

B

A

M

Instr Fetch

“IF” Stage “ID/RF” Stage

Decode & Reg Fetch

1 2

“EX” StageExecution

32A

L

U

32

32

op

IR

Y

M

3

Resolving a RAW hazard by forwarding

ADD R4,R3,R2OR R5,R4,R2ADD R4,R3,R2OR R5,R4,R2

Sample program

ALU computes R4 in the EX stage, so ...Just forward it

back!

Unlike stalling, does not change CPI. May hurt cycle time.

56

Page 57: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

The Hazard Resolution Toolkit

Stall earlier instructions in pipeline.

Kill earlier instructions in pipeline.

Forward results computed in later pipeline stages to earlier stages.Add new hardware or rearrange hardware design to eliminate hazard.

Make hardware handle concurrent requests to eliminate hazard.

Change ISA to eliminate hazard.

57

Page 58: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

I1:I2:I3:I4:I5:

t1 t2 t3 t4 t5 t6 t7 t8Time:Inst

I6:

Control Hazards: Fix with more hardware

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

BEQ R4,R3,25

SUB R1,R9,R8AND R6,R5,R4

I1:I2:I3:

Sample Program(ISA w/o branch delay slot) IF ID

IF

EX

ID

IF

MEM WBEX stage computes if branch is taken

If branch is taken, these instructions

MUST NOT complete!

If we add hardware, can we move it here?

58

Page 59: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Resolving control hazard with hardware

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch

Stage #1 Stage #2 Stage #3

Decode & Reg Fetch

==

To branch control logic

59

Page 60: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

I1:I2:I3:I4:I5:

t1 t2 t3 t4 t5 t6 t7 t8Time:Inst

I6:

Control Hazards: After more hardware

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

BEQ R4,R3,25

SUB R1,R9,R8AND R6,R5,R4

I1:I2:I3:

Sample Program(ISA w/o branch delay slot) IF ID

IF

EX MEM WB

If branch is taken, this instruction MUST NOT

complete!

ID stage computes if branch is taken

If we change ISA, can we always let I2 complete (”branch delay slot”) and

eliminate the control hazard.

60

Page 61: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

The Hazard Resolution Toolkit

Stall earlier instructions in pipeline.

Kill earlier instructions in pipeline.

Forward results computed in later pipeline stages to earlier stages.Add new hardware or rearrange hardware design to eliminate hazard.

Make hardware handle concurrent requests to eliminate hazard.

Change ISA to eliminate hazard.

61

Page 62: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Resolve control hazard by killing instr

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch

Stage #1 Stage #2 Stage #3

Decode & Reg Fetch

J 200

J 200OR R5,R4,R2

Sample program(no delay slot) Detect J

instruction, muxa NOP into IR 1/2

Compute new PC using hardware not shown ...

This hurts CPI.

One can do better.

62

Page 63: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Hazard Diagnosis

Assume MIPS ISA in examples to follow ...

63

Page 64: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Data Hazards: Read After Write

Read After Write (RAW) hazards.Instruction I2 expects to read a datavalue written by an earlier instruction,but I2 executes “too early” and readsthe wrong copy of the data.

Classic solution: use forwarding heavily, fall back on stalling when forwarding won’twork or slows down the critical path too much.

64

Page 65: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Mux,Logic

Full bypass network ...

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

Ext

IR IR

B

A

M

32A

L

U

32

32

op

IR

Y

M

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

WE, MemToReg

ID (Decode) EX MEM WB

From WB

65

Page 66: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Mux,Logic

Common bug: Multiple forwards ...

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

Ext

IR IR

B

A

M

32A

L

U

32

32

op

IR

Y

M

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

WE, MemToReg

ID (Decode) EX MEM WB

From WB

ADD R4,R3,R2 OR R2,R3,R1 AND R2,R2,R1

Which do we forward from?

66

Page 67: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Mux,Logic

Common bug: Multiple forwards II ...

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

Ext

IR IR

B

A

M

32A

L

U

32

32

op

IR

Y

M

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

WE, MemToReg

ID (Decode) EX MEM WB

From WB

ADD R4,R0,R2 OR R0,R3,R1 AND R0,R2,R1

Which do we forward from?

67

Page 68: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

LW and Hazards

No load “delay slot”

68

Page 69: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Mux,Logic

Questions about LW and forwarding

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

Ext

IR IR

B

A

M

32A

L

U

32

32

op

IR

Y

M

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

WE, MemToReg

ID (Decode) EX MEM WB

From WB

ADDIU R1 R1 24 LW R1 128(R29)

Do we need to stall ?OR R3,R3,R2

69

Page 70: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Mux,Logic

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

Ext

IR IR

B

A

M

32A

L

U

32

32

op

IR

Y

M

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

WE, MemToReg

ID (Decode) EX MEM WB

From WB

ADDIU R1 R1 24 LW R1 128(R29)

Do we need to stall ?OR R1,R3,R1

Questions about LW and forwarding

70

Page 71: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Branches and Hazards

Single “delay slot”

71

Page 72: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Recall: Control hazard and hardware

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch

Stage #1 Stage #2 Stage #3

Decode & Reg Fetch

==

To branch control logic

72

Page 73: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

I1:I2:I3:I4:I5:

t1 t2 t3 t4 t5 t6 t7 t8Time:Inst

I6:

Recall: After more hardware, change ISA

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

BEQ R4,R3,25

SUB R1,R9,R8AND R6,R5,R4

I1:I2:I3:

Sample Program(ISA w/o branch delay slot) IF ID

IF

EX MEM WB

If branch is taken, this instruction MUST NOT

complete!

ID stage computes if branch is taken

If we change ISA, can we always let I2 complete (”branch delay slot”) and

eliminate the control hazard.

73

Page 74: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Mux,Logic

Question about branch and forwards:

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

Ext

IR IR

B

A

M

32A

L

U

32

32

op

IR

Y

M

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

WE, MemToReg

ID (Decode) EX MEM WB

BEQ R1 R3 label

Will this work as shown?OR R3,R3,R1

==

To branch control logic

74

Page 75: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Lessons learned

Pipelining is hard

Write test code in advance

Study every instruction

Think about interactions ...

75

Page 76: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Control Implementation

76

Page 77: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Recall: What is single cycle control?

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

ExtRegDest

ALUsrcExtOp

ALUctr

32A

L

U

32

32

op

MemToReg

32Dout

Data Memory

WE32

Din

Addr

MemWr

Equal

RegWr

32Addr Data

InstrMem

Equal

RegDestRegWr

ExtOpALUsrc MemWr

MemToReg

PCSrc

Combinational Logic(Only Gates, No Flip Flops)Just specify logic functions!

77

Page 78: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

In pipelines, all IR registers are used

IR IR IR IR

ID (Decode) EX MEM WB

Equal

RegDestRegWr

ExtOp MemToReg

PCSrc

Combinational Logic(Only Gates, No Flip Flops)

(add extra state outside!)

A “conceptual” design -- for shortest critical path, IR registers may hold decoded info,

not the complete 32-bit instruction

78

Page 79: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Advanced Pipelining

79

Page 80: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

5 Stage Pipeline: A point of departure

CS 152 L10 Pipeline Intro (9) Fall 2004 © UC Regents

Graphically Representing MIPS Pipeline

Can help with answering questions like:how many cycles does it take to execute this code?what is the ALU doing during cycle 4?is there a hazard, why does it occur, and how can it be fixed?

ALUIM Reg DM Reg

SecondsProgram

InstructionsProgram

= SecondsCycle Instruction

Cycles

At best, the 5-stage pipeline executes one instruction per

clock, with a clock period determined by the slowest stage

Filling all delay slots(branch,load)

Perfect

caching

Processor has no “multi-cycle” instructions (ex: multiply with an accumulate register)

80

Page 81: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Superpipelining: Add more stagesSeconds

Program

Instructions

Program= Seconds

Cycle Instruction

Cycles

Goal: Reduce critical path byadding more pipeline stages.

Difficulties: Added penalties for load delays and branch misses.

Ultimate Limiter: As logic delay goes to 0, FF clk-to-Q and setup.

1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

Example: 8-stage ARM XScale:extra IF, ID, data cache stages.

Also, power!

81

Page 82: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Seconds

Program

Instructions

Program= Seconds

Cycle Instruction

Cycles

Goal: Improve CPI by issuing several instructions per cycle.

Difficulties: Load and branchdelays affect more instructions.Ultimate Limiter: Programs maybe a poor match to issue rules.

!"#$%&

!"#$%

&'"()*+,-*.,,/

012.3-*4++556

789($:;9*<9:$*=

)'"'($%":#$:(#

>8#?

>8#?

.*(?(

.*(?(

+(?(+(?(+(?(

!"##$

%&%'#&(')

%*+,&*##$

%&%'#&(')

789($:;9*89:$#*)'@%*:9$%"9'A*B:B%A:9%*"%C:#$%"#

!"";B%"'9D#*'"%*A'$()%D*E)%9*'9*:9#$"8($:;9*

%9$%"#*'*F89($:;9*89:$*

!"":9B8$#*$;*'*F89($:;9*89:$*G%

1C1-*"%C:#$%"*F:A%

H

('9*()'9C%*D8":9C*'*A;9C*A'$%9(?*;B%"'$:;9

'((%B$

'((%B$

!"#$%

&'"()*+,-*.,,/

012.3-*4++550

&8A$:BA%*789($:;9*<9:$#

I7IJ

KL

M4<

&%N

7'DD

7N8A

7D:@

I##8%

OPQR#

7PQR#

Example: CPU with floating point ALUs: issue 1 FP + 1 integer instruction per cycle.

Superscalar: Multiple issues per cycle

82

Page 83: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Throughput and multiple threadsGoal: Use multiple CPUs (real and virtual) to improve (1) throughput of machines that run many programs (2) execution time of multi-threaded programs.

Difficulties: Gaining full advantage requires rewriting applications, OS, libraries.

Ultimate limiter: Amdahl’s law, memory system performance.

Example: Sun Niagara (8 SPARCs on one chip).

83

Page 84: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Superpipelining

84

Page 85: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

CS

152 L10 Pipeline Intro (9)Fall 2004 ©

UC

Regents

Graphically R

epresenting MIP

S Pipeline

Can help w

ith answering questions like:

how m

any cycles does it take to execute this code?w

hat is the ALU

doing during cycle 4?is there a hazard, w

hy does it occur, and how can it be fixed?

ALU

IMR

egD

MR

eg

IR

ID+RF

EX

MEM

WB

IR

IR

IR

IF

5 Stage1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

8 Stage

IF now takes 2 stages (pipelined I-cache)

ID and RF each get a stage.ALU split over 3 stagesMEM takes 2 stages (pipelined D-cache)

Note: Some stages now overlap, some instructions

take extra stages.

85

Page 86: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Superpipelining techniques ...

Split ALU and decode logic over several pipeline stages.

Pipeline memory: Use more banks of smaller arrays, add pipeline stages between decoders, muxes.

Remove “rarely-used” forwarding networks that are on critical path.

Pipeline the wires of frequently used forwarding networks.

Creates stalls, affects CPI.

Also: Clocking tricks (example: negedge register file)

86

Page 87: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Add pipeline stages, reduce clock periodSeconds

Program

Instructions

Program= Seconds

Cycle Instruction

Cycles

1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

Q. Could adding pipeline stages hurt the CPI for an application?

ARM XScale8 stages

CPI Problem Possible Solution

Taken branches cause longer

stallsBranch prediction,

loop unrolling

Cache misses take more

clock cycles

Larger caches, add prefetch

opcodes to ISA

A. Yes, due to these problems:

87

Page 88: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

I1:I2:I3:I4:I5:

t1 t2 t3 t4 t5 t6 t7 t8Time:Inst

I6:

Recall: Control hazards ...

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

BEQ R4,R3,25

SUB R1,R9,R8AND R6,R5,R4

I1:I2:I3:

Sample Program(ISA w/o branch delay slot) IF ID

IF

EX

ID

IF

MEM WBEX stage computes if branch is taken

If branch is taken, these instructions

MUST NOT complete!

We avoiding stalling by (1) adding a branch delay slot, and (2) adding comparator to ID stageIf we add more early stages, we must stall.

I-Cache

88

Page 89: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

I1:I2:I3:I4:I5:

t1 t2 t3 t4 t5 t6 t7 t8Time:Inst

I6:

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

IF ID

IF

EX

ID

IF

MEM WBEX stage computes if branch is taken

If we predicted incorrectly, these instructions MUST

NOT complete!

We update the PC based on the outputs of the branch predictor. If it is perfect, pipe stays full!Dynamic Predictors: a cache of branch history

I-Cache

Solution: Branch prediction ...

A control instr?

Taken or Not Taken?

If taken, where to? What PC?

Branch Predictor

Predictions

89

Page 90: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Superscalar

Basic Idea: Improve CPI by issuing several instructions per cycle.

90

Page 91: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

rd1

RegFile

rd2

WE1

wd1

rs1

rs2

ws1

WE2

rd3

rd4

rs3

rs4

wd2

ws2

A

B

A

B

32A

L

U

32

32

op

Y

32A

L

U

32

32

op

Y

R

R

Superscalar R machine

Addr

DataInstrMem

64

32PC and

Sequencer

Instruction Issue Logic

91

Page 92: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

IR IR

ID (Decode) EX (ALU)

IR IR

MEM WB

rd1

RegFile

rd2

WE1

wd1

rs1

rs2

ws1

WE2

rd3

rd4

rs3

rs4

wd2

ws2

A

B

A

B

32A

L

U

32

32

op

Y

32A

L

U

32

32

op

Y

R

R

Sustaining Dual Instr Issues

(no forwarding)

ADD R21,R20,R19ADD R24,R23,R22

ADD R21,R20,R19

ADD R24,R23,R22

ADD R15,R14,R13ADD R18,R17,R16

ADD R15,R14,R13

ADD R18,R17,R16

ADD R27,R26,R25ADD R30,R29,R28

ADD R27

ADD R30

ADD R9,R8,R7

ADD R12,R11,R10

ADD R9,R8,R7ADD R12,R11,R10

ADD R8,R0,R0ADD R11,R0,R0

It’s rarely this good ...

92

Page 93: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

IR IR

ID (Decode) EX (ALU)

IR IR

MEM WB

rd1

RegFile

rd2

WE1

wd1

rs1

rs2

ws1

WE2

rd3

rd4

rs3

rs4

wd2

ws2

A

B

A

B

32A

L

U

32

32

op

Y

32A

L

U

32

32

op

Y

R

R

Worst-Case Instruction Issue

NOP

ADD R8,

ADD R8,R0,R0

ADD R9,R8,R0

ADD R9,R8,R0

ADD R10,R9,R0

ADD R10,R9,R0

ADD R11,R10,R0

ADD R11,R10,R0

NOP NOP NOP

Dependencies force

“serialization”

We add 12 forwarding buses (not shown).(6 to each ID from stages of both pipes).

93

Page 94: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Multi-Threading

94

Page 95: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Mux,Logic

Recall: Bypass network prevents stalls

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

Ext

IR IR

B

A

M

32A

L

U

32

32

op

IR

Y

M

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

WE, MemToReg

ID (Decode) EX MEM WB

From WB

Instead of bypass: Interleave threads on the pipeline to prevent stalls ...

95

Page 96: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Krste

November 10, 2004

6.823, L18--3

Multithreading

How can we guarantee no dependencies between instructions in a pipeline?

-- One way is to interleave execution of instructions from different program threads on same pipeline

F D X M W

t0 t1 t2 t3 t4 t5 t6 t7 t8

T1: LW r1, 0(r2)

T2: ADD r7, r1, r4

T3: XORI r5, r4, #12

T4: SW 0(r7), r5

T1: LW r5, 12(r1)

t9

F D X M W

F D X M W

F D X M W

F D X M W

Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe

Last instruction

in a thread

always completes

writeback before

next instruction

in same thread

reads regfile

KrsteNovember 10, 2004

6.823, L18--5

Simple Multithreaded Pipeline

Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage

+1

2 Thread

select

PC1

PC1

PC1

PC1

I$ IRGPR1GPR1GPR1GPR1

X

Y

2

D$

Introduced in 1964 by Seymour Cray4 CPUs,each run at 1/4 clock

Many variants ...

96

Page 97: CS 250 VLSI System Design Lecture 10 Design Verificationcs250/fa10/lectures/lec10.pdfsoftware simulator or FPGA hardware. Catch synthesis bugs. Formally verify netlist against Verilog

UC Regents Fall 2010 © UCBCS 250 L10: Design Verification

Upcoming: Project Proposals

97