58
CS 152 L5: Timing UC Regents Fall 2006 © UCB 2006-9-12 John Lazzaro (www.cs.berkeley.edu/~lazzaro) CS 152 Computer Architecture and Engineering Lecture 5 Timing www-inst.eecs.berkeley.edu/~cs152/ TAs: Udam Saini and Jue Sun

Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

CS 152 L5: Timing UC Regents Fall 2006 © UCB

2006-9-12John Lazzaro

(www.cs.berkeley.edu/~lazzaro)

CS 152 Computer Architecture and Engineering

Lecture 5 – Timing

www-inst.eecs.berkeley.edu/~cs152/

TAs: Udam Saini and Jue Sun

Page 2: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Last Time: Making a Test Plan ...

Which testing types are good for each epoch?

processorassemblycomplete

correctlyexecutes

singleinstructions

correctlyexecutes

shortprograms

Time

Epoch 1 Epoch 2 Epoch 3 Epoch 4unit testingearly

multiunit

testinglater

processortesting

withself-checks

multi-unit testing

unit testing

diagnostics

complete processor

testingverification

processortesting

withself-checks

diagnostics

processortesting

withself-checks

multi-unit testing

unit testing

diagnostics

complete processor

testing

Top-downtesting

Bottom-uptesting

unit testing

multi-unit testing

processortesting

withself-checks

Page 3: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Idea: get confidence in “going to board” earlier ...

processorassemblycomplete

correctlyexecutes

singleinstructions

correctlyexecutes

shortprograms

Time

Epoch 1 Epoch 2 Epoch 3 Epoch 4complete processor

testing

Top-downtesting

Bottom-uptesting

unit testing

multi-unit testing

processortesting

withself-checks

ModelSim

20 %

Xilinx

80 %

ModelSim

80 %

Xilinx

20 %

ModelSim

20 %

Xilinx

80 %

ModelSim

20 %

Xilinx

80 %

Also: catch Synplicity “warnings and errors” earlier“latch generated”, “combinational loop detected”, etc

Last Time: Works in ModelSim, but ...

Page 4: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

Human Strategies▶ Group design effort

§ Everyone is clear on the specs▶ Modular work effort

§ Divide the work between all your teammates

§ Avoid having 4 people working on 1 screen

▶ But help each other test§ Fresh eyes catch different bugs

Teamwork lessons from previous semesters ...

Page 5: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Today: Determine minimum clock period

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

ExtRegDest

ALUsrcExtOp

ALUctr

32A

L

U

32

32

op

MemToReg

32Dout

Data Memory

WE32

Din

Addr

MemWr

Equal

RegWr

Equal

Control Lines

Combinational Logic

Clk

32

Addr Data

Instr

Mem

32D

PC

Q

32

32

+

32

32

0x4

PCSrc

32

+

32

CS 152 L06 Single Cycle 1 (6) UC Regents Fall 2004 © UCB

Step 1a: The MIPS-lite Subset for today

° ADD and SUB• addU rd, rs, rt• subU rd, rs, rt

° OR Immediate:• ori rt, rs, imm16

° LOAD and STORE Word• lw rt, rs, imm16• sw rt, rs, imm16

° BRANCH:• beq rs, rt, imm16

op rs rt rd shamt funct061116212631

6 bits 6 bits5 bits5 bits5 bits5 bits

op rs rt immediate016212631

6 bits 16 bits5 bits5 bits

op rs rt immediate016212631

6 bits 16 bits5 bits5 bits

op rs rt immediate016212631

6 bits 16 bits5 bits5 bits

E

x

t

e

n

d

Page 6: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Today’s Lecture: Timing Analysis

Xilinx and delay

Clocked logic and delay

Combinational logic delay

Page 7: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

CS 152 L5: Timing UC Regents Fall 2006 © UCB

View from 10,000 Feet

Page 8: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

Architects draw blocks ...Circuit designers draw transistors

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8

!"#$%&#&%'()*+#$,-%-)./)0.1%2)3($#,

4 5678.'(9):8#+-%-&.8);.9($<))

!"#$4 =()8(/(8)&.)&8#+-%-&.8)>-&8(+1&?>)#-)

&?()#6."+&)./)2"88(+&)&?#&)/$.@-)/.8)

#)1%'(+ A9- #+9 A1-B)

4 :?()-&8(+1&?)%-)$%+(#8$,)78.7.8&%.+#$)

&.)&?()8#&%.)./)=C0B)

%"#$

Logic is where they meet. !"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8

!"#$%&#&%'()*+#$,-%-)./)0.1%2)3($#,

4 5678.'(9):8#+-%-&.8);.9($<))

!"#$4 =()8(/(8)&.)&8#+-%-&.8)>-&8(+1&?>)#-)

&?()#6."+&)./)2"88(+&)&?#&)/$.@-)/.8)

#)1%'(+ A9- #+9 A1-B)

4 :?()-&8(+1&?)%-)$%+(#8$,)78.7.8&%.+#$)

&.)&?()8#&%.)./)=C0B)

%"#$

Page 9: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Architects reach logic top-down ...

Next State Combinational Logic

next_Gnext_R next_YR G Y

ChangeRst

wire next_R, next_Y, next_G;

assign next_R = rst ? 1’b1 : (change ? Y : R); assign next_Y = rst ? 1’b0 : (change ? G : Y);assign next_G = rst ? 1’b0 : (change ? R : G);

... using Verilog and schematics.

Page 10: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

EEs reach logic bottom-up ...

Can you build a processorentirely out of NAND gates?

Small number of high-performance

logic circuits.

For some definition of “small” and

“high-performance”

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.33

Basic Components: CMOS Logic Gates

NOR GateNAND Gate

A B Out

0 0 10 1 11 0 11 1 0

A B Out

0 0 10 1 01 0 01 1 0

OutA

B

A

B

Out

Out = A + BOut = A • B

Vdd

A

B

Out

Vdd

A

B

Out

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.34

Basic Components: CMOS Logic Gates

Out

A

B

C

D

More Inputs More asymmetric Edges Times!

Vdd

Out

B

C

D

A

4-input NAND Gate

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.35

Ideal versus Reality

° When input 0 -> 1, output 1 -> 0 but NOT instantly• Output goes 1 -> 0: output voltage goes from Vdd (5v) to 0v

° When input 1 -> 0, output 0 -> 1 but NOT instantly• Output goes 0 -> 1: output voltage goes from 0v to Vdd (5v)

° Voltage does not like to change instantaneously

Vin

Vout

1 => Vdd

VoltageOutIn

0 => GND

Time

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.36

Fluid Timing Model

Level (V) = Vdd

Vdd

° Water ! Electrical Charge Tank Capacity ! Capacitance (C)

° Water Level ! Voltage Water Flow ! Charge Flowing (Current)

° Size of Pipes ! Strength of Transistors (G)

° Time to fill up the tank proportional to C / G

Reservoir Tank

(Cout)Bottomless Sea

Sea Level

(GND)

SW2SW1SW1

Tank Level (Vout)

Cout

Vout

SW2

Page 11: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Logic Synthesis often bridges the gap ...

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.5

Design Refinement

Informal System Requirement

Initial Specification

Intermediate Specification

Final Architectural Description

Intermediate Specification of Implementation

Final Internal Specification

Physical Implementation

refinementincreasing level of detail

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.6

Logic Components

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.7

° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)

° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay

° Flip-Flops: Storage Elements• After a clock edge, input copied to output

• Otherwise, the flip-flop holds its value

• Also: a “Latch” is a storage element that is level triggered

D Q D[8] Q[8]

8

Combinational

Logic

11

8

Elements of the design zoo

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.8

Basic Combinational Elements+DeMorgan Equivalence

Wire Inverter

In Out

01

01

In Out

10

01

OutIn

Out = InOut = In

NAND Gate NOR GateA B Out

111

0 00 11 01 1 0

A B Out

0 0 10 1 01 0 01 1 0

OutA

BA

B

Out

DeMorgan’s

TheoremOut = A + B = A • BOut = A • B = A + B

A

B

Out

A B Out

1 1 11 0 10 1 10 0 0

0 00 11 01 1

A B

OutA

B

A B Out

1 1 11 0 00 1 00 0 0

0 00 11 01 1

A B

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.5

Design Refinement

Informal System Requirement

Initial Specification

Intermediate Specification

Final Architectural Description

Intermediate Specification of Implementation

Final Internal Specification

Physical Implementation

refinementincreasing level of detail

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.6

Logic Components

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.7

° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)

° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay

° Flip-Flops: Storage Elements• After a clock edge, input copied to output

• Otherwise, the flip-flop holds its value

• Also: a “Latch” is a storage element that is level triggered

D Q D[8] Q[8]

8

Combinational

Logic

11

8

Elements of the design zoo

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.8

Basic Combinational Elements+DeMorgan Equivalence

Wire Inverter

In Out

01

01

In Out

10

01

OutIn

Out = InOut = In

NAND Gate NOR GateA B Out

111

0 00 11 01 1 0

A B Out

0 0 10 1 01 0 01 1 0

OutA

BA

B

Out

DeMorgan’s

TheoremOut = A + B = A • BOut = A • B = A + B

A

B

Out

A B Out

1 1 11 0 10 1 10 0 0

0 00 11 01 1

A B

OutA

B

A B Out

1 1 11 0 00 1 00 0 0

0 00 11 01 1

A B

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.5

Design Refinement

Informal System Requirement

Initial Specification

Intermediate Specification

Final Architectural Description

Intermediate Specification of Implementation

Final Internal Specification

Physical Implementation

refinementincreasing level of detail

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.6

Logic Components

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.7

° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)

° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay

° Flip-Flops: Storage Elements• After a clock edge, input copied to output

• Otherwise, the flip-flop holds its value

• Also: a “Latch” is a storage element that is level triggered

D Q D[8] Q[8]

8

Combinational

Logic

11

8

Elements of the design zoo

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.8

Basic Combinational Elements+DeMorgan Equivalence

Wire Inverter

In Out

01

01

In Out

10

01

OutIn

Out = InOut = In

NAND Gate NOR GateA B Out

111

0 00 11 01 1 0

A B Out

0 0 10 1 01 0 01 1 0

OutA

BA

B

Out

DeMorgan’s

TheoremOut = A + B = A • BOut = A • B = A + B

A

B

Out

A B Out

1 1 11 0 10 1 10 0 0

0 00 11 01 1

A B

OutA

B

A B Out

1 1 11 0 00 1 00 0 0

0 00 11 01 1

A B

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.5

Design Refinement

Informal System Requirement

Initial Specification

Intermediate Specification

Final Architectural Description

Intermediate Specification of Implementation

Final Internal Specification

Physical Implementation

refinementincreasing level of detail

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.6

Logic Components

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.7

° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)

° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay

° Flip-Flops: Storage Elements• After a clock edge, input copied to output

• Otherwise, the flip-flop holds its value

• Also: a “Latch” is a storage element that is level triggered

D Q D[8] Q[8]

8

Combinational

Logic

11

8

Elements of the design zoo

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.8

Basic Combinational Elements+DeMorgan Equivalence

Wire Inverter

In Out

01

01

In Out

10

01

OutIn

Out = InOut = In

NAND Gate NOR GateA B Out

111

0 00 11 01 1 0

A B Out

0 0 10 1 01 0 01 1 0

OutA

BA

B

Out

DeMorgan’s

TheoremOut = A + B = A • BOut = A • B = A + B

A

B

Out

A B Out

1 1 11 0 10 1 10 0 0

0 00 11 01 1

A B

OutA

B

A B Out

1 1 11 0 00 1 00 0 0

0 00 11 01 1

A B

assign next_R = rst ? 1’b1 : (change ? Y : R); assign next_Y = rst ? 1’b0 : (change ? G : Y);assign next_G = rst ? 1’b0 : (change ? R : G);

Still, in the highest performance

designs, human designers do (some) logic, circuits, and

layout by hand.

Page 12: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

CS 152 L5: Timing UC Regents Fall 2006 © UCB

A Logic Circuit Primer

“Models should be as simple as possible, but no simpler ...” Albert Einstein.

Page 13: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Inverters: A simple transistor model

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.5

Design Refinement

Informal System Requirement

Initial Specification

Intermediate Specification

Final Architectural Description

Intermediate Specification of Implementation

Final Internal Specification

Physical Implementation

refinementincreasing level of detail

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.6

Logic Components

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.7

° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)

° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay

° Flip-Flops: Storage Elements• After a clock edge, input copied to output

• Otherwise, the flip-flop holds its value

• Also: a “Latch” is a storage element that is level triggered

D Q D[8] Q[8]

8

Combinational

Logic

11

8

Elements of the design zoo

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.8

Basic Combinational Elements+DeMorgan Equivalence

Wire Inverter

In Out

01

01

In Out

10

01

OutIn

Out = InOut = In

NAND Gate NOR GateA B Out

111

0 00 11 01 1 0

A B Out

0 0 10 1 01 0 01 1 0

OutA

BA

B

Out

DeMorgan’s

TheoremOut = A + B = A • BOut = A • B = A + B

A

B

Out

A B Out

1 1 11 0 10 1 10 0 0

0 00 11 01 1

A B

OutA

B

A B Out

1 1 11 0 00 1 00 0 0

0 00 11 01 1

A B

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.29

Delay Model:

CMOS

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.30

Review: General C/L Cell Delay Model

° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior

- truth-table, logic equation, VHDL

• load factor of each input

• critical propagation delay from each input to each output for each transition

- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load

° Linear model composes

Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.31

Basic Technology: CMOS

° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors

• PMOS (P-Type Metal Oxide Semiconductor) transistors

° NMOS Transistor• Apply a HIGH (Vdd) to its gate

turns the transistor into a “conductor”

• Apply a LOW (GND) to its gateshuts off the conduction path

° PMOS Transistor• Apply a HIGH (Vdd) to its gate

shuts off the conduction path

• Apply a LOW (GND) to its gateturns the transistor into a “conductor”

Vdd = 5V

GND = 0v

Vdd = 5V

GND = 0v

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.32

Basic Components: CMOS Inverter

Vdd

Circuit

° Inverter Operation

OutIn

SymbolPMOS

NMOS

In Out

Vdd

Open

Charge

VoutVdd

Vdd

Out

Open

Discharge

Vin

Vdd

Vdd

“1”

“0”

pFET.A switch. “On” if gate is grounded.

nFET.A switch. “On” if gate is at Vdd.

“1”“0”

“1” “0”

This model is too simple to be useful ...

Page 14: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Transistors as water valvesIf electrons are water molecules,

and a capacitor a bucket ...

A “on” p-FET fillsup the capacitor

with charge.

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.29

Delay Model:

CMOS

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.30

Review: General C/L Cell Delay Model

° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior

- truth-table, logic equation, VHDL

• load factor of each input

• critical propagation delay from each input to each output for each transition

- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load

° Linear model composes

Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.31

Basic Technology: CMOS

° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors

• PMOS (P-Type Metal Oxide Semiconductor) transistors

° NMOS Transistor• Apply a HIGH (Vdd) to its gate

turns the transistor into a “conductor”

• Apply a LOW (GND) to its gateshuts off the conduction path

° PMOS Transistor• Apply a HIGH (Vdd) to its gate

shuts off the conduction path

• Apply a LOW (GND) to its gateturns the transistor into a “conductor”

Vdd = 5V

GND = 0v

Vdd = 5V

GND = 0v

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.32

Basic Components: CMOS Inverter

Vdd

Circuit

° Inverter Operation

OutIn

SymbolPMOS

NMOS

In Out

Vdd

Open

Charge

VoutVdd

Vdd

Out

Open

Discharge

Vin

Vdd

Vdd

A “on” n-FET empties the

bucket.

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.29

Delay Model:

CMOS

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.30

Review: General C/L Cell Delay Model

° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior

- truth-table, logic equation, VHDL

• load factor of each input

• critical propagation delay from each input to each output for each transition

- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load

° Linear model composes

Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.31

Basic Technology: CMOS

° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors

• PMOS (P-Type Metal Oxide Semiconductor) transistors

° NMOS Transistor• Apply a HIGH (Vdd) to its gate

turns the transistor into a “conductor”

• Apply a LOW (GND) to its gateshuts off the conduction path

° PMOS Transistor• Apply a HIGH (Vdd) to its gate

shuts off the conduction path

• Apply a LOW (GND) to its gateturns the transistor into a “conductor”

Vdd = 5V

GND = 0v

Vdd = 5V

GND = 0v

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.32

Basic Components: CMOS Inverter

Vdd

Circuit

° Inverter Operation

OutIn

SymbolPMOS

NMOS

In Out

Vdd

Open

Charge

VoutVdd

Vdd

Out

Open

Discharge

Vin

Vdd

Vdd

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)

!"#$%&'(#)*(+,%-$*".(/0

1 2+.$0#$03

1 4546%,"#$3

“1”

“0”Time

Water level

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)

!"#$%&'(#)*(+,%-$*".(/0

1 2+.$0#$03

1 4546%,"#$3

“0”

“1”

TimeWater level

This model is often good enough ...

Page 15: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

What is the bucket? A gate’s “fan-out”.

Driving other gates slows a gate down.

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)

!"#$%&'(#)*(+,%-$*".(/0

1 2+.$0#$03

1 4546%,"#$3

Driving wires slows a gate down.

“Fan-out”: The number of gate inputs driven by a gate’s output.

Page 16: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

CS 152 L5: Timing UC Regents Fall 2006 © UCB

Why we call it “fan-out”

Page 17: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-(

!"#$%&$'"(

) *"+,-.#/

) 01$%2$'"(%-3%"%4"#$%56%78-7-8#5-+"'%#-%5#6%-.#7.#%9"7"95#"+9$:%%;$9".6$<%4"#$6%=>%"+2%?%#.8+%-+@-33%"#%"%'"#$8%#5A$:%%BC#%#"D$6%'-+4$8%3-8%#1$%-.#7.#%-3%4"#$%=E%#-%8$"91%#1$%6F5#915+4%#18$61-'2%-3%4"#$6%=>%"+2%?%"6 F$%"22%A-8$%-.#7.#%9"7"95#"+9$:G

E

?

>

A closer look at fan-out ...

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.37

Series Connection

Vdd

Cout

Vout

C1

V1G2

Vdd

Voltage

Vdd

Vin

GND

V1 Vout

Vdd/2

d1 d2

G1

V1Vin Vout

VinG1 G2

Time

° Total Propagation Delay = Sum of individual delays = d1 + d2

° Capacitance C1 has two components:

• Capacitance of the wire connecting the two gates

• Input capacitance of the second inverter

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.38

Calculating Aggregate Delays

Vdd

G2

Vdd

° Sum delays along serial paths

° Delay (Vin -> V2) ! = Delay (Vin -> V3)• Delay (Vin -> V2) = Delay (Vin -> V1) + Delay (V1 -> V2)

• Delay (Vin -> V3) = Delay (Vin -> V1) + Delay (V1 -> V3)

° Critical Path = The longest among the N parallel paths

° C1 = Wire C + Cin of Gate 2 + Cin of Gate 3

V2

V1Vin V2

G1V1

C1

Vin

Vdd

G3V3

V3

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.39

Characterize a Gate

° Input capacitance for each input

° For each input-to-output path:• For each output transition type (H->L, L->H, H->Z, L->Z ... etc.)

- Internal delay (ns)

- Load dependent delay (ns / fF)

° Example: 2-input NAND Gate

OutA

B

Delay A -> Out

Out: Low -> High

0.5ns

Slope =

0.0021ns / fF

For A and B: Input Load (I.L.) = 61 fF

For either A -> Out or B -> Out:

Tlh = 0.5ns Tlhf = 0.0021ns / fF

Thl = 0.1ns Thlf = 0.0020ns / fF

Cout

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.40

A Specific Example: 2 to 1 MUX

Y = (A and !S)

or (B and S)

A

B

S

Gate 3

Gate 2

Gate 1Wire 1

Wire 2

Wire 0

A

B

Y

S

2 x

1M

ux

° Input Load (I.L.)• A, B: I.L. (NAND) = 61 fF

• S: I.L. (INV) + I.L. (NAND) = 50 fF + 61 fF = 111 fF

° Load Dependent Delay (L.D.D.): Same as Gate 3• TAYlhf = 0.0021 ns / fF TAYhlf = 0.0020 ns / fF

• TBYlhf = 0.0021 ns / fF TBYhlf = 0.0020 ns / fF

• TSYlhf = 0.0021 ns / fF TSYlhf = 0.0020 ns / fF

Linear model

works for reasonable

fan-out

Driving more gates adds delay.

Page 18: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Propagation delay graphs ...

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'--

!"#$%&$'"(

) *"+,"-$-%."#$+/

012#

034

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'--

!"#$%&$'"(

) *"+,"-$-%."#$+/

012#

034

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'--

!"#$%&$'"(

) *"+,"-$-%."#$+/

012#

034

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'--

!"#$%&$'"(

) *"+,"-$-%."#$+/

012#

034

1->0

Page 19: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Intuition: Critical paths ...

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-*

!"#$%&$'"(

) *+",-.,/

) 01"#%.2%#1$%3$'"(%.,%#1.2%4.546.#7

) !"#$#%&'()&$*+(#1$%8"#1%9.#1%#1$%:";.:6:%3$'"(<%=5>:%",(%

.,86#%#>%",(%>6#86#?

@ A,%B$,$5"'<%9$%.,4'63$%5$B.2#$5%2$#-68%",3%4'C-#>-D%#.:$2%.,%

45.#.4"'%8"#1%4"'46'"#.>,?

) 01(%3>%9$%4"5$%"E>6#%#1$ %"#$#%&'(,&$*-

x = g(a, b, c, d, e, f)

If d going 0-to-1 switches x 0-to-1, delay is T1.

If a going 0-to-1 switches x 0-to-1, delay is T2.

It would be surprising if T1 > T2.

T1

T2

T2 might be the critical (worst-case delay) path.

Page 20: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Why “might”? Wires have delay too ...

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8

!"#$%&$'()

* +,$-%"-%./01$%2(1$1%3/$#$%./$%

.#(-14"11"0-%'"-$%$55$2.%"1%

-$6'"6"7'$8

9 !"#$1%:011$1%;"1.#"7<.$;%

#$1"1.(-2$%(-;%2(:(2".(-2$

9 ="4$%20-1.(-.%(1102"(.$;%3"./%

;"1.#"7<.$;%>?%"1%:#0:0#."0-('%.0%

./$%!"#$%& 05%./$%'$-6./

* @0#%!"#$%&'($)! 0-%A?1B%#$1"1.(-2$%"1%"-1"6-"5"2(-.%C#$'(.",$%.0%$55$2.",$%>%05%.#(-1"1.0#1DB%7<.%?%"1%"4:0#.(-.E

9 =):"2('')%(#0<-;%/('5%05%?%05%6(.$%'0(;%"1%"-%./$%3"#$1E

* @0#%*#+,&'($)! 0-%A?18

9 7<11$1B%2'02F%'"-$1B%6'07('%20-.#0'%1"6-('B%$.2E

9 >$1"1.(-2$%"1%1"6-"5"2(-.B%./$#$50#$%;"1.#"7<.$;%>?%$55$2.%;04"-(.$1E

9 1"6-('1%(#$%.):"2('')%G#$7<55$#$;H%.0%#$;<2$%;$'()8

,I

,J,K

,L

."4$

,I ,L ,K ,J

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8

!"#$%&$'()

* +,$-%"-%./01$%2(1$1%3/$#$%./$%

.#(-14"11"0-%'"-$%$55$2.%"1%

-$6'"6"7'$8

9 !"#$1%:011$1%;"1.#"7<.$;%

#$1"1.(-2$%(-;%2(:(2".(-2$

9 ="4$%20-1.(-.%(1102"(.$;%3"./%

;"1.#"7<.$;%>?%"1%:#0:0#."0-('%.0%

./$%!"#$%& 05%./$%'$-6./

* @0#%!"#$%&'($)! 0-%A?1B%#$1"1.(-2$%"1%"-1"6-"5"2(-.%C#$'(.",$%.0%$55$2.",$%>%05%.#(-1"1.0#1DB%7<.%?%"1%"4:0#.(-.E

9 =):"2('')%(#0<-;%/('5%05%?%05%6(.$%'0(;%"1%"-%./$%3"#$1E

* @0#%*#+,&'($)! 0-%A?18

9 7<11$1B%2'02F%'"-$1B%6'07('%20-.#0'%1"6-('B%$.2E

9 >$1"1.(-2$%"1%1"6-"5"2(-.B%./$#$50#$%;"1.#"7<.$;%>?%$55$2.%;04"-(.$1E

9 1"6-('1%(#$%.):"2('')%G#$7<55$#$;H%.0%#$;<2$%;$'()8

,I

,J,K

,L

."4$

,I ,L ,K ,J

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8

!"#$%&$'()

* +,$-%"-%./01$%2(1$1%3/$#$%./$%

.#(-14"11"0-%'"-$%$55$2.%"1%

-$6'"6"7'$8

9 !"#$1%:011$1%;"1.#"7<.$;%

#$1"1.(-2$%(-;%2(:(2".(-2$

9 ="4$%20-1.(-.%(1102"(.$;%3"./%

;"1.#"7<.$;%>?%"1%:#0:0#."0-('%.0%

./$%!"#$%& 05%./$%'$-6./

* @0#%!"#$%&'($)! 0-%A?1B%#$1"1.(-2$%"1%"-1"6-"5"2(-.%C#$'(.",$%.0%$55$2.",$%>%05%.#(-1"1.0#1DB%7<.%?%"1%"4:0#.(-.E

9 =):"2('')%(#0<-;%/('5%05%?%05%6(.$%'0(;%"1%"-%./$%3"#$1E

* @0#%*#+,&'($)! 0-%A?18

9 7<11$1B%2'02F%'"-$1B%6'07('%20-.#0'%1"6-('B%$.2E

9 >$1"1.(-2$%"1%1"6-"5"2(-.B%./$#$50#$%;"1.#"7<.$;%>?%$55$2.%;04"-(.$1E

9 1"6-('1%(#$%.):"2('')%G#$7<55$#$;H%.0%#$;<2$%;$'()8

,I

,J,K

,L

."4$

,I ,L ,K ,J

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8

!"#$%&$'()

* +,$-%"-%./01$%2(1$1%3/$#$%./$%

.#(-14"11"0-%'"-$%$55$2.%"1%

-$6'"6"7'$8

9 !"#$1%:011$1%;"1.#"7<.$;%

#$1"1.(-2$%(-;%2(:(2".(-2$

9 ="4$%20-1.(-.%(1102"(.$;%3"./%

;"1.#"7<.$;%>?%"1%:#0:0#."0-('%.0%

./$%!"#$%& 05%./$%'$-6./

* @0#%!"#$%&'($)! 0-%A?1B%#$1"1.(-2$%"1%"-1"6-"5"2(-.%C#$'(.",$%.0%$55$2.",$%>%05%.#(-1"1.0#1DB%7<.%?%"1%"4:0#.(-.E

9 =):"2('')%(#0<-;%/('5%05%?%05%6(.$%'0(;%"1%"-%./$%3"#$1E

* @0#%*#+,&'($)! 0-%A?18

9 7<11$1B%2'02F%'"-$1B%6'07('%20-.#0'%1"6-('B%$.2E

9 >$1"1.(-2$%"1%1"6-"5"2(-.B%./$#$50#$%;"1.#"7<.$;%>?%$55$2.%;04"-(.$1E

9 1"6-('1%(#$%.):"2('')%G#$7<55$#$;H%.0%#$;<2$%;$'()8

,I

,J,K

,L

."4$

,I ,L ,K ,J

Looksbenign,but ...

Page 21: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

CS 152 L5: Timing UC Regents Fall 2006 © UCB

Clocked Logic Circuits

Page 22: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

From Delay Models to Timing Analysis1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8

!"#$%&'

( )#*#&&'&+,-+.'*/#&+0-12'*,'*3+

#

4 5+! ,/$'60&7"89+:+,/$'6$;"9+:+,/$'6.',;%9

5+! #0&7"8 :+#$;" :+#.',;%

0&7

f T1 MHz 1 μs

10 MHz 100 ns100 MHz 10 ns

1 GHz 1 ns

Timing AnalysisWhat is the

smallest T that produces correct

operation?

Page 23: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Timing Analysis and Logic Delay

If T > worst-case delay through CL, does this ensure correct operation?

1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

1600

IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001

Fig.1.ProcessSEMcrosssection.

Theprocess

wasraisedfrom[1]tolimitstandbypower.

Circuitdesignandarchitecturalpipeliningensurelowvoltage

performanceandfunctionality.Tofurtherlimitstandbycurrent

inhandheldASSPs,alongerpolytargettakesadvantageofthe

versus

dependenceandsource-to-bodybiasisused

toelectricallylimittransistor

instandbymode.Allcore

nMOSandpMOStransistorsutilizeseparatesourceandbulk

connectionstosupportthis.Theprocessincludescobaltdisili-

cidegatesanddiffusions.Lowsourceanddraincapacitance,as

wellas3-nmgate-oxidethickness,allowhighperformanceand

low-voltageoperation. III.ARCHITECTURE

Themicroprocessorcontains32-kBinstructionanddata

cachesaswellasaneight-entrycoalescingwritebackbuffer.

Theinstructionanddatacachefillbuffershavetwoandfour

entries,respectively.Thedatacachesupportshit-under-miss

operationandlinesmaybelockedtoallowSRAM-likeoper-

ation.Thirty-two-entryfullyassociativetranslationlookaside

buffers(TLBs)thatsupportmultiplepagesizesareprovided

forbothcaches.TLBentriesmayalsobelocked.A128-entry

branchtargetbufferimprovesbranchperformanceapipeline

deeperthanearlierhigh-performanceARMdesigns[2],[3].

A.PipelineOrganization

Toobtainhighperformance,themicroprocessorcoreutilizes

asimplescalarpipelineandahigh-frequencyclock.Inaddition

toavoidingthepotentialpowerwasteofasuperscalarapproach,

functionaldesignandvalidationcomplexityisdecreasedatthe

expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,

thepipelinepartitioningbalancestheworkloadandensuresthat

noonepipelinestageistight.Themainintegerpipelineisseven

stages,memoryoperationsfollowaneight-stagepipeline,and

whenoperatinginthumbmodeanextrapipestageisinserted

afterthelastfetchstagetoconvertthumbinstructionsintoARM

instructions.Sincethumbmodeinstructions[11]are16b,two

instructionsarefetchedinparallelwhileexecutingthumbin-

structions.Asimplifieddiagramoftheprocessorpipelineis

Fig.2.Microprocessorpipelineorganization.

showninFig.2,wherethestateboundariesareindicatedby

gray.Featuresthatallowthemicroarchitecturetoachievehigh

speedareasfollows.

TheshifterandALUresideinseparatestages.TheARMin-

structionsetallowsashiftfollowedbyanALUoperationina

singleinstruction.Previousimplementationslimitedfrequency

byhavingtheshiftandALUinasinglestage.Splittingthisop-

erationreducesthecriticalALUbypasspathbyapproximately

1/3.Theextrapipelinehazardintroducedwhenaninstructionis

immediatelyfollowedbyonerequiringthattheresultbeshifted

isinfrequent.

DecoupledInstructionFetch.Atwo-instructiondeepqueueis

implementedbetweenthesecondfetchandinstructiondecode

pipestages.Thisallowsstallsgeneratedlaterinthepipetobe

deferredbyoneormorecyclesintheearlierpipestages,thereby

allowinginstructionfetchestoproceedwhenthepipeisstalled,

andalsorelievesstallspeedpathsintheinstructionfetchand

branchpredictionunits.

Deferredregisterdependency

stalls.Whileregisterdepen-

denciesarecheckedintheRFstage,stallsduetothesehazards

aredeferreduntiltheX1stage.Allthenecessaryoperandsare

thencapturedfromresult-forwardingbussesastheresultsare

returnedtotheregisterfile.

Oneofthemajorgoalsofthedesignwastominimizetheen-

ergyconsumedtocompleteagiventask.Conventionalwisdom

hasbeenthatshorterpipelinesaremoreefficientduetore-

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.9

General C/L Cell Delay Model

° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior

- truth-table, logic equation, VHDL

• Input load factor of each input

• Propagation delay from each input to each output for each transition

- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load

° Linear model composes

Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.10

Storage Element’s Timing Model

Clk

D Q

° Setup Time: Input must be stable BEFORE trigger clock edge

° Hold Time: Input must REMAIN stable after trigger clock edge

° Clock-to-Q time:

• Output cannot change instantaneously at the trigger clock edge

• Similar to delay in logic gates, two components:

- Internal Clock-to-Q

- Load dependent Clock-to-Q

Don’t Care Don’t Care

HoldSetup

D

Unknown

Clock-to-Q

Q

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.11

Clocking Methodology

Clk

Combination Logic.

.

.

.

.

.

.

.

.

.

.

.

° All storage elements are clocked by the same clock edge

° The combination logic blocks:• Inputs are updated at each clock tick

• All outputs MUST be stable before the next clock tick

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.12

Critical Path & Cycle Time

Clk

.

.

.

.

.

.

.

.

.

.

.

.

° Critical path: the slowest path between any two storage devices

° Cycle time is a function of the critical path

° must be greater than:

Clock-to-Q + Longest Path through Combination Logic + Setup

Register:

An Array of Flip-Flops

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.9

General C/L Cell Delay Model

° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior

- truth-table, logic equation, VHDL

• Input load factor of each input

• Propagation delay from each input to each output for each transition

- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load

° Linear model composes

Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.10

Storage Element’s Timing Model

Clk

D Q

° Setup Time: Input must be stable BEFORE trigger clock edge

° Hold Time: Input must REMAIN stable after trigger clock edge

° Clock-to-Q time:

• Output cannot change instantaneously at the trigger clock edge

• Similar to delay in logic gates, two components:

- Internal Clock-to-Q

- Load dependent Clock-to-Q

Don’t Care Don’t Care

HoldSetup

D

Unknown

Clock-to-Q

Q

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.11

Clocking Methodology

Clk

Combination Logic.

.

.

.

.

.

.

.

.

.

.

.

° All storage elements are clocked by the same clock edge

° The combination logic blocks:• Inputs are updated at each clock tick

• All outputs MUST be stable before the next clock tick

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.12

Critical Path & Cycle Time

Clk

.

.

.

.

.

.

.

.

.

.

.

.

° Critical path: the slowest path between any two storage devices

° Cycle time is a function of the critical path

° must be greater than:

Clock-to-Q + Longest Path through Combination Logic + Setup

Combinational Logic

Page 24: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Flip Flops have internal delays ...

D Q

CLK

Value of D is sampled on positive clock edge.

Q outputs sampled value for rest of cycle.

D

Q

t_setup

t_clk-to-Q

Where do Flip Flop delays come from? Wait for VLSI lectures.

Page 25: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Flip-Flop delays eat into “time budget”1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8

!"#$%&'

( )#*#&&'&+,-+.'*/#&+0-12'*,'*3+

#

4 5+! ,/$'60&7"89+:+,/$'6$;"9+:+,/$'6.',;%9

5+! #0&7"8 :+#$;" :+#.',;%

0&7

ALU “time budget”

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8

!"#"$%&'()*"&')+',-#./$)#)01'23$.034

5 6#'7"#"$%&8'+)$'.)$$".4')9"$%43)#:

+)$'%&&'9%4/1;

5 <)='*)'="'"#0>"$%4"'!""#$!%&'?

@ A#-'.3$.034'3#904')$'$"7314"$')04904'4)'%#-'$"7314"$'3#904')$'.3$.034'

)04904;

@ B1"409'43>"C'+)$'.3$.034')049041'*"9"#*1')#'=/%4'34'.)##".41'4)

@ B.&DEF'43>"C'+)$'.3$.034'3#9041'*"9"#*1')#'+$)>'=/"$"'34'.)>"1;

$"7 $"72G 2G

.&).D 3#904

)04904

)943)#'+""*H%.D

3#904 )04904

I'! 43>"J.&D"FK'L'43>"J2GK'L'43>"J1"409K

I'! #.&D"F L'#2G L'#1"409

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.9

General C/L Cell Delay Model

° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior

- truth-table, logic equation, VHDL

• Input load factor of each input

• Propagation delay from each input to each output for each transition

- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load

° Linear model composes

Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.10

Storage Element’s Timing Model

Clk

D Q

° Setup Time: Input must be stable BEFORE trigger clock edge

° Hold Time: Input must REMAIN stable after trigger clock edge

° Clock-to-Q time:

• Output cannot change instantaneously at the trigger clock edge

• Similar to delay in logic gates, two components:

- Internal Clock-to-Q

- Load dependent Clock-to-Q

Don’t Care Don’t Care

HoldSetup

D

Unknown

Clock-to-Q

Q

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.11

Clocking Methodology

Clk

Combination Logic.

.

.

.

.

.

.

.

.

.

.

.

° All storage elements are clocked by the same clock edge

° The combination logic blocks:• Inputs are updated at each clock tick

• All outputs MUST be stable before the next clock tick

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.12

Critical Path & Cycle Time

Clk

.

.

.

.

.

.

.

.

.

.

.

.

° Critical path: the slowest path between any two storage devices

° Cycle time is a function of the critical path

° must be greater than:

Clock-to-Q + Longest Path through Combination Logic + Setup

Combinational Logic

Page 26: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Clock skew also eats into “time budget”

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8

!"#$%&'%()&*$#+,-.

/ 01&$"#$%&2(34#5&6&7&6!896:(,;296$"%!<=&$43$;4,&)4""&1>4"-

/ 6?(3(1#3(@

A-&!#+,3#"&$"#$%&:%()

>.&!>3(1;"&$"#$%&54:,34B;,4#+-&&CD;>"4E(&2>,?&5(">F&13#G&$"#$%&:#;3$(&,#&>""&$"#$%&"#>5:&BF&$#+,3#""4+H&)43(:&5(">F&>+5&B;11(3&5(">F-

B.&5#+I,&JH>,(K&$"#$%:-

L-&6&" 6!896:(,;296$"%!<&9&)#3:,&$>:(&:%()-

/ M#:,&G#5(3+&">3H(&?4H?N2(31#3G>+$(&$?42:&*G4$3#23#$(::#3:.&$#+,3#"&(+5&,#&(+5&$"#$%&:%()&,#&>&1()&,(+,?:&#1&>&+>+#:($#+5-

$"#$%&:%()=&5(">F&4+&54:,34B;,4#+

!8

!8O!8OI

!8O

!8OI

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8

!"#$%&'%()&*$#+,-.

/ 0#,(&1(2(13(4&5677(1-

/ 8+&,9:3&$;3(<&$"#$%&3%()&;$,6;""=&>1#2:4(3&!"#$%&#'(! *;443&

,#&,9(&(77($,:2(&$"#$%&>(1:#4.-

/ ?9:3&(77($,&9;3&5((+&63(4&,#&9(">&16+&$:1$6:,3&;3&9:@9(1&

$"#$%&1;,(3-&&A:3%=&563:+(33B

!C

!CD

!CDE

$"#$%&3%()<&4(";=&:+&4:3,1:56,:#+

!CD

!CDE

As T →0, which circuit

fails first?

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8

!"#$%&'%()&*$#+,-.

/ 01&$"#$%&2(34#5&6&7&6!896:(,;296$"%!<=&$43$;4,&)4""&1>4"-

/ 6?(3(1#3(@

A-&!#+,3#"&$"#$%&:%()

>.&!>3(1;"&$"#$%&54:,34B;,4#+-&&CD;>"4E(&2>,?&5(">F&13#G&$"#$%&:#;3$(&,#&>""&$"#$%&"#>5:&BF&$#+,3#""4+H&)43(:&5(">F&>+5&B;11(3&5(">F-

B.&5#+I,&JH>,(K&$"#$%:-

L-&6&" 6!896:(,;296$"%!<&9&)#3:,&$>:(&:%()-

/ M#:,&G#5(3+&">3H(&?4H?N2(31#3G>+$(&$?42:&*G4$3#23#$(::#3:.&$#+,3#"&(+5&,#&(+5&$"#$%&:%()&,#&>&1()&,(+,?:&#1&>&+>+#:($#+5-

$"#$%&:%()=&5(">F&4+&54:,34B;,4#+

!8

!8O!8OI

!8O

!8OI

CLKd CLKd

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8

!"#$%&'%()&*$#+,-.

/ 01&$"#$%&2(34#5&6&7&6!896:(,;296$"%!<=&$43$;4,&)4""&1>4"-

/ 6?(3(1#3(@

A-&!#+,3#"&$"#$%&:%()

>.&!>3(1;"&$"#$%&54:,34B;,4#+-&&CD;>"4E(&2>,?&5(">F&13#G&$"#$%&:#;3$(&,#&>""&$"#$%&"#>5:&BF&$#+,3#""4+H&)43(:&5(">F&>+5&B;11(3&5(">F-

B.&5#+I,&JH>,(K&$"#$%:-

L-&6&" 6!896:(,;296$"%!<&9&)#3:,&$>:(&:%()-

/ M#:,&G#5(3+&">3H(&?4H?N2(31#3G>+$(&$?42:&*G4$3#23#$(::#3:.&$#+,3#"&(+5&,#&(+5&$"#$%&:%()&,#&>&1()&,(+,?:&#1&>&+>+#:($#+5-

$"#$%&:%()=&5(">F&4+&54:,34B;,4#+

!8

!8O!8OI

!8O

!8OICLKd

Page 27: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Some Flip Flops have “hold” time ...

D

t_setup

CLK

t_hold

D must stay

stable here

D Q

CLK

Does flip-flop hold time affect operation of this circuit? Under what conditions?

t_inv

What is the intended function of this circuit?

t_clk-to-Q + t_inv > t_holdFor correct operation.

Page 28: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Searching for processor critical path1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

Timing AnalysisWhat is the

smallest T that produces correct

operation?Must considerall connectedregister pairs.

?

Why might I suspect this one?

Page 29: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

ExtRegDest

ALUsrcExtOp

ALUctr

32A

L

U

32

32

op

MemToReg

32Dout

Data Memory

WE32

Din

Addr

MemWr

Equal

RegWr

Equal

Control Lines

Combinational Logic

Clk

32

Addr Data

Instr

Mem

32D

PC

Q

32

32

+

32

32

0x4

PCSrc

32

+

32

CS 152 L06 Single Cycle 1 (6) UC Regents Fall 2004 © UCB

Step 1a: The MIPS-lite Subset for today

° ADD and SUB• addU rd, rs, rt• subU rd, rs, rt

° OR Immediate:• ori rt, rs, imm16

° LOAD and STORE Word• lw rt, rs, imm16• sw rt, rs, imm16

° BRANCH:• beq rs, rt, imm16

op rs rt rd shamt funct061116212631

6 bits 6 bits5 bits5 bits5 bits5 bits

op rs rt immediate016212631

6 bits 16 bits5 bits5 bits

op rs rt immediate016212631

6 bits 16 bits5 bits5 bits

op rs rt immediate016212631

6 bits 16 bits5 bits5 bits

E

x

t

e

n

d

Searching for processor critical path

Page 30: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Real Stuff: Timing Estimation, Closure

Timing EstimationPredicting a

processor’s clock rate early in the

project

From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.

Page 31: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Real Stuff: Timing Estimation, Closure

Timing ClosureMeeting

(or exceeding!) the timing estimate

From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.

Page 32: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Real Stuff: Timing Estimation, Closure

From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.

netlist. Of these, 121 713 were top-level chip global nets,and 21 711 were processor-core-level global nets. Againstthis model 3.5 million setup checks were performed in latemode at points where clock signals met data signals inlatches or dynamic circuits. The total number of timingchecks of all types performed in each chip run was9.8 million. Depending on the configuration of the timingrun and the mix of actual versus estimated design data,the amount of real memory required was in the rangeof 12 GB to 14 GB, with run times of about 5 to 6 hoursto the start of timing-report generation on an RS/6000*Model S80 configured with 64 GB of real memory.Approximately half of this time was taken up by readingin the netlist, timing rules, and extracted RC networks, as

well as building and initializing the internal data structuresfor the timing model. The actual static timing analysistypically took 2.5–3 hours. Generation of the entirecomplement of reports and analysis required an additional5 to 6 hours to complete. A total of 1.9 GB of timingreports and analysis were generated from each chip timingrun. This data was broken down, analyzed, and organizedby processor core and GPS, individual unit, and, in thecase of timing contracts, by unit and macro. This was onecomponent of the 24-hour-turnaround time achieved forthe chip-integration design cycle. Figure 26 shows theresults of iterating this process: A histogram of the finalnominal path delays obtained from static timing for thePOWER4 processor.

The POWER4 design includes LBIST and ABIST(Logic/Array Built-In Self-Test) capability to enable full-frequency ac testing of the logic and arrays. Such testingon pre-final POWER4 chips revealed that several circuitmacros ran slower than predicted from static timing. Thespeed of the critical paths in these macros was increasedin the final design. Typical fast ac LBIST laboratory testresults measured on POWER4 after these paths wereimproved are shown in Figure 27.

SummaryThe 174-million-transistor !1.3-GHz POWER4 chip,containing two microprocessor cores and an on-chipmemory subsystem, is a large, complex, high-frequencychip designed by a multi-site design team. Theperformance and schedule goals set at the beginning ofthe project were met successfully. This paper describesthe circuit and physical design of POWER4, emphasizingaspects that were important to the project’s success in theareas of design methodology, clock distribution, circuits,power, integration, and timing.

Figure 25

POWER4 timing flow. This process was iterated daily during the physical design phase to close timing.

VIM

Timer files ReportsAsserts

Spice

Spice

GL/1

Reports

< 12 hr

< 12 hr

< 12 hr

< 48 hr

< 24 hr

Non-uplift timing

Noiseimpacton timing

Upliftanalysis

Capacitanceadjust

Chipbench /EinsTimer

Chipbench /EinsTimer

Extraction

Core or chipwiring

Analysis/update(wires, buffers)

Notes:• Executed 2–3 months prior to tape-out• Fully extracted data from routed designs • Hierarchical extraction• Custom logic handled separately • Dracula • Harmony• Extraction done for • Early • Late

Extracted units (flat or hierarchical)Incrementally extracted RLMsCustom NDRsVIMs

Figure 26

Histogram of the POWER4 processor path delays.

!40 !20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280Timing slack (ps)

Lat

e-m

ode

timin

g ch

ecks

(th

ousa

nds)

0

50

100

150

200

IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. D. WARNOCK ET AL.

47

Most wires have hundreds of picoseconds to spare.The critical path

Page 33: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

Real Stuff: Floorplanning Intel XScale 80200

Page 34: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

CS 152 L5: Timing UC Regents Fall 2006 © UCB

Administrivia: Upcoming deadlines ...

Friday 9/15: “ModelSim Checkoff”, in section, 125 Cory.

Monday 9/25: Lab 2 final report due via the submit program, 11:59 PM.

Friday 9/22: “Xilinx Checkoff”, in section, 125 Cory.

Page 35: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Office Hours, Mid-terms ...

Mid-term 1: Tuesday October 3th,6:00 to 9:00 PM, TBA.Mid-term 2: Tuesday December 5th,6:00 to 9:00 PM, TBA.

Card Key Woes? Go to the office you handed your form into and ask why. Let me know what they say ...

Udam: MW 6-7 PM, 125 CoryJue: TTh 3-4 PM, 125 CoryJohn: TTh 10-11AM, 315 Soda

Page 36: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

CS 152 L5: Timing UC Regents Fall 2006 © UCB

Timing in Xilinx Designs

Spartan-3 FPGA Family: Introduction and Ordering Information

4 www.xilinx.com DS099-1 (v1.3) July 13, 20041-800-255-7778 Preliminary Product Specification

6

R

Package Marking

Table 3: Spartan-3 I/O Chart

Device

Available User I/Os and Differential (Diff) I/O Pairs

VQ100VQG100

TQ144TQG144

PQ208PQG208

FT256FTG256

FG320FGG320

FG456FGG456

FG676FGG676

FG900FGG900

FG1156FGG1156

User Diff User Diff User Diff User Diff User Diff User Diff User Diff User Diff User Diff

XC3S50 63 29 97 46 124 56 - - - - - - - - - - - -

XC3S200 63 29 97 46 141 62 173 76 - - - - - - - - - -

XC3S400 - - 97 46 141 62 173 76 221 100 264 116 - - - - - -

XC3S1000 - - - - - - 173 76 221 100 333 149 391 175 - - - -

XC3S1500 - - - - - - - - 221 100 333 149 487 221 - - - -

XC3S2000 - - - - - - - - - - - - 489 221 565 270 - -

XC3S4000 - - - - - - - - - - - - - - 633 300 712 312

XC3S5000 - - - - - - - - - - - - - - 633 300 784 344

Notes: 1. All device options listed in a given package column are pin-compatible.

Lot Code

Date CodeXC3S50TM

PQ208xxx0350xxxxxxxxx4C

SPARTAN

Device TypePackage

Speed Grade

Temperature Range

R

R

ds099-1_03_071304

Page 37: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

CS 152 L5: Timing UC Regents Fall 2006 © UCB

Prior Art for FPGAs ...

Page 38: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Xilinx: Large Logic Array + Block RAM

Spartan-3 FPGA Family: Introduction and Ordering Information

2 www.xilinx.com DS099-1 (v1.3) July 13, 20041-800-255-7778 Preliminary Product Specification

6

R

Architectural OverviewThe Spartan-3 family architecture consists of five funda-mental programmable functional elements:

• Configurable Logic Blocks (CLBs) contain RAM-basedLook-Up Tables (LUTs) to implement logic and storageelements that can be used as flip-flops or latches.CLBs can be programmed to perform a wide variety oflogical functions as well as to store data.

• Input/Output Blocks (IOBs) control the flow of databetween the I/O pins and the internal logic of thedevice. Each IOB supports bidirectional data flow plus3-state operation. Twenty-four different signalstandards, including seven high-performancedifferential standards, are available as shown inTable 2. Double Data-Rate (DDR) registers areincluded. The Digitally Controlled Impedance (DCI)feature provides automatic on-chip terminations,simplifying board designs.

• Block RAM provides data storage in the form of 18-Kbitdual-port blocks.

• Multiplier blocks accept two 18-bit binary numbers asinputs and calculate the product.

• Digital Clock Manager (DCM) blocks provideself-calibrating, fully digital solutions for distributing,delaying, multiplying, dividing, and phase shifting clocksignals.

These elements are organized as shown in Figure 1. A ringof IOBs surrounds a regular array of CLBs. The XC3S50has a single column of block RAM embedded in the array.Those devices ranging from the XC3S200 to the XC3S2000have two columns of block RAM. The XC3S4000 andXC3S5000 devices have four RAM columns. Each columnis made up of several 18K-bit RAM blocks; each block isassociated with a dedicated multiplier. The DCMs are posi-tioned at the ends of the outer block RAM columns.

The Spartan-3 family features a rich network of traces andswitches that interconnect all five functional elements,transmitting signals among them. Each functional elementhas an associated switch matrix that permits multiple con-nections to the routing.

Figure 1: Spartan-3 Family Architecture

DS099-1_01_032703

Notes: 1. The two additional block RAM columns of the XC3S4000 and XC3S5000

devices are shown with dashed lines. The XC3S50 has only the block RAM column on the far left.

From: Xilinx Spartan 3 data sheet, modifiedto approximateVirtex architecture.

CLB == Configurable Logic Block“Swiss Army Knife” part

I/O Block (off-chip)

Page 39: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Blades in the CLB “Swiss Army Knife”

Virtex™-E 1.8 V Field Programmable Gate ArraysR

Module 2 of 4 www.xilinx.com DS022-2 (v2.4) July 17, 20024 1-800-255-7778 Production Product Specification

Storage ElementsThe storage elements in the Virtex-E slice can be config-ured either as edge-triggered D-type flip-flops or aslevel-sensitive latches. The D inputs can be driven either by

the function generators within the slice or directly from sliceinputs, bypassing the function generators.

In addition to Clock and Clock Enable signals, each Slicehas synchronous set and reset signals (SR and BY). SR

Figure 4: 2-Slice Virtex-E CLB

F1

F2

F3

F4

G1

G2

G3

G4

Carry &Control

Carry &Control

Carry &Control

Carry &Control

LUT

CINCIN

COUT COUT

YQ

XQXQ

YQ

X

XB

YYBYB

Y

BX

BY

BX

BY

G1

G2

G3

G4

F1

F2

F3

F4

Slice 1 Slice 0

XB

X

LUTLUT

LUT DCE

Q

RC

SP

DCE

Q

RC

SP

DCE

Q

RC

SP

DCE

Q

RC

SP

ds022_04_121799

Figure 5: Detailed View of Virtex-E Slice

BY

F5IN

SRCLKCE

BX

YB

Y

YQ

XB

X

XQ

G4G3G2G1

F4F3F2F1

CIN

0

1

1

0

F5 F5

ds022_05_092000

COUT

CY

DCE

Q

DCE

Q

F6

CK WSO

WSHWEA4

BY DG

BX DI

DI

O

WEI3I2I1I0

LUT

CY

I3I2I1I0

O

DIWE

LUT

INIT

INIT

REV

REV

Edge triggeredflip-flip

Virtex™-E 1.8 V Field Programmable Gate ArraysR

Module 2 of 4 www.xilinx.com DS022-2 (v2.4) July 17, 20024 1-800-255-7778 Production Product Specification

Storage ElementsThe storage elements in the Virtex-E slice can be config-ured either as edge-triggered D-type flip-flops or aslevel-sensitive latches. The D inputs can be driven either by

the function generators within the slice or directly from sliceinputs, bypassing the function generators.

In addition to Clock and Clock Enable signals, each Slicehas synchronous set and reset signals (SR and BY). SR

Figure 4: 2-Slice Virtex-E CLB

F1

F2

F3

F4

G1

G2

G3

G4

Carry &Control

Carry &Control

Carry &Control

Carry &Control

LUT

CINCIN

COUT COUT

YQ

XQXQ

YQ

X

XB

YYBYB

Y

BX

BY

BX

BY

G1

G2

G3

G4

F1

F2

F3

F4

Slice 1 Slice 0

XB

X

LUTLUT

LUT DCE

Q

RC

SP

DCE

Q

RC

SP

DCE

Q

RC

SP

DCE

Q

RC

SP

ds022_04_121799

Figure 5: Detailed View of Virtex-E Slice

BY

F5IN

SRCLKCE

BX

YB

Y

YQ

XB

X

XQ

G4G3G2G1

F4F3F2F1

CIN

0

1

1

0

F5 F5

ds022_05_092000

COUT

CY

DCE

Q

DCE

Q

F6

CK WSO

WSHWEA4

BY DG

BX DI

DI

O

WEI3I2I1I0

LUT

CY

I3I2I1I0

O

DIWE

LUT

INIT

INIT

REV

REV

Virtex™-E 1.8 V Field Programmable Gate ArraysR

Module 2 of 4 www.xilinx.com DS022-2 (v2.4) July 17, 20024 1-800-255-7778 Production Product Specification

Storage ElementsThe storage elements in the Virtex-E slice can be config-ured either as edge-triggered D-type flip-flops or aslevel-sensitive latches. The D inputs can be driven either by

the function generators within the slice or directly from sliceinputs, bypassing the function generators.

In addition to Clock and Clock Enable signals, each Slicehas synchronous set and reset signals (SR and BY). SR

Figure 4: 2-Slice Virtex-E CLB

F1

F2

F3

F4

G1

G2

G3

G4

Carry &Control

Carry &Control

Carry &Control

Carry &Control

LUT

CINCIN

COUT COUT

YQ

XQXQ

YQ

X

XB

YYBYB

Y

BX

BY

BX

BY

G1

G2

G3

G4

F1

F2

F3

F4

Slice 1 Slice 0

XB

X

LUTLUT

LUT DCE

Q

RC

SP

DCE

Q

RC

SP

DCE

Q

RC

SP

DCE

Q

RC

SP

ds022_04_121799

Figure 5: Detailed View of Virtex-E Slice

BY

F5IN

SRCLKCE

BX

YB

Y

YQ

XB

X

XQ

G4G3G2G1

F4F3F2F1

CIN

0

1

1

0

F5 F5

ds022_05_092000

COUT

CY

DCE

Q

DCE

Q

F6

CK WSO

WSHWEA4

BY DG

BX DI

DI

O

WEI3I2I1I0

LUT

CY

I3I2I1I0

O

DIWE

LUT

INIT

INIT

REV

REV

Adder carry chain, multiplier step,LUT expansion logic.

LUTboxcanalsoturnintoRAMor ashiftregisterchain

1

1

1

1

1

1

1

1

1

1

example g(F1, F2, F3, F4): F1 ^ F2 ^ F3 ^ F4

Look Up Table (LUT)

g()

Page 40: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Inside a LUT ...

Virtex™-E 1.8 V Field Programmable Gate ArraysR

Module 2 of 4 www.xilinx.com DS022-2 (v2.4) July 17, 20024 1-800-255-7778 Production Product Specification

Storage ElementsThe storage elements in the Virtex-E slice can be config-ured either as edge-triggered D-type flip-flops or aslevel-sensitive latches. The D inputs can be driven either by

the function generators within the slice or directly from sliceinputs, bypassing the function generators.

In addition to Clock and Clock Enable signals, each Slicehas synchronous set and reset signals (SR and BY). SR

Figure 4: 2-Slice Virtex-E CLB

F1

F2

F3

F4

G1

G2

G3

G4

Carry &Control

Carry &Control

Carry &Control

Carry &Control

LUT

CINCIN

COUT COUT

YQ

XQXQ

YQ

X

XB

YYBYB

Y

BX

BY

BX

BY

G1

G2

G3

G4

F1

F2

F3

F4

Slice 1 Slice 0

XB

X

LUTLUT

LUT DCE

Q

RC

SP

DCE

Q

RC

SP

DCE

Q

RC

SP

DCE

Q

RC

SP

ds022_04_121799

Figure 5: Detailed View of Virtex-E Slice

BY

F5IN

SRCLKCE

BX

YB

Y

YQ

XB

X

XQ

G4G3G2G1

F4F3F2F1

CIN

0

1

1

0

F5 F5

ds022_05_092000

COUT

CY

DCE

Q

DCE

Q

F6

CK WSO

WSHWEA4

BY DG

BX DI

DI

O

WEI3I2I1I0

LUT

CY

I3I2I1I0

O

DIWE

LUT

INIT

INIT

REV

REV

INPUTS 1

11

1

1

!"#$%&'())* ++,!-.)'/ 012)*/3456 47&1'-8

!"#$%&$'()(*%+$+,'-.$'%/(

0 1)$)2+3/$%&$%$4-*(./$-56+(5()/%/-,)$

,7$%$73)./-,)$!"#!$%!&'()8

0 9%.:$+%/.:$+,.%/-,)$:,+4&$/:($;%+3($

,7$/:($73)./-,)$.,**(&6,)4-)'$/,$

,)($-)63/$.,5<-)%/-,)8

====$$$$>?=@=@=@=A===B$$$$>?=@=@=@BA==B=$$$$>?=@=@B@=A==BB$$$$>?=@=@B@BA==BB=B===B=B=BB==BBBB===B==BB=B=B=BBBB==BB=BBBB=BBBB

CDE"#F

&/,*($-)$B&/$+%/.:

&/,*($-)$G)4$+%/.:

!"#$%&'()*+&,-

!"#$%&'().+&,-

HI1DJCDE"#F

BB$$$$$B$$$$$BB=$$$$$=$$$$$B=B$$$$$=$$$$$B==$$$$$=$$$$$=

C56+(5()/&$#/0 73)./-,)$,7$G$-)63/&8$$

K,L$5%)M$,7$/:(&($$%*($/:(*(N

K,L$5%)M$73)./-,)&$,7$)$-)63/&N

gg

gg

!"#$%&'())* ++,!-.)'/ 012)*/3456 47&1'-*

!"#$%&'()*+(+,-.-/0,1 ,"2/-&#$%&/3&/()*+(+,-+4&.3&.&5,

6&7&(+(089:

; /,)<-3&=>003+&0,+&0?&5,&(+(089&

*0=.-/0,3@

; (+(089&*0=.-/0,3&A*.-=>+3B&.8+&

,08(.**9&*0.4+4&C/->&D.*<+3&?80(&

<3+8E3&=0,?/F<8.-/0,&2/-&3-8+.(@

; ',)<-3&-0&(<6&=0,-80*&.8+&->+&

G#H&/,)<-3@

1 I+3<*-&/3&.&F+,+8.*&)<8)03+&

J*0F/=&F.-+K@&&

; ,"#$%&=.,&/()*+(+,-&!"#

?<,=-/0,&0?&,&/,)<-3L

*.-=>

*.-=>

*.-=>

*.-=>

7M&6&7

(<67M

'NO$%P

Q$%O$%

#.-=>+3&)80F8.((+4&.3&).8-0?&=0,?/F<8.-/0,&2/-"3-8+.(

FF

FF

FF

FF

!"#$%&'())* ++,!-.)'/ 012)*/3456 47&1'-*

!"#$%&'()*+(+,-.-/0,1 ,"2/-&#$%&/3&/()*+(+,-+4&.3&.&5,

6&7&(+(089:

; /,)<-3&=>003+&0,+&0?&5,&(+(089&

*0=.-/0,3@

; (+(089&*0=.-/0,3&A*.-=>+3B&.8+&

,08(.**9&*0.4+4&C/->&D.*<+3&?80(&

<3+8E3&=0,?/F<8.-/0,&2/-&3-8+.(@

; ',)<-3&-0&(<6&=0,-80*&.8+&->+&

G#H&/,)<-3@

1 I+3<*-&/3&.&F+,+8.*&)<8)03+&

J*0F/=&F.-+K@&&

; ,"#$%&=.,&/()*+(+,-&!"#

?<,=-/0,&0?&,&/,)<-3L

*.-=>

*.-=>

*.-=>

*.-=>

7M&6&7

(<67M

'NO$%P

Q$%O$%

#.-=>+3&)80F8.((+4&.3&).8-0?&=0,?/F<8.-/0,&2/-"3-8+.(

1

11

11

1

1

1

1

Part of a FF “scan chain”

To next FF in chain ...

...

Set during configuration.

Page 41: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Key things to remember ...

How to learn what to do: read the Synplicity and Xilinx documentation, try small examples, look at CAD tool log files and output, ask the TAs.

The way you structure your design (and your Verilog) can make logic mapping “better” (denser, faster).

CAD tools choose mapping from Verilog to CLB resources.

Page 42: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

After routing ...

Page 43: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Xilinx: Large Array of CLBs, plus RAM

Spartan-3 FPGA Family: Introduction and Ordering Information

2 www.xilinx.com DS099-1 (v1.3) July 13, 20041-800-255-7778 Preliminary Product Specification

6

R

Architectural OverviewThe Spartan-3 family architecture consists of five funda-mental programmable functional elements:

• Configurable Logic Blocks (CLBs) contain RAM-basedLook-Up Tables (LUTs) to implement logic and storageelements that can be used as flip-flops or latches.CLBs can be programmed to perform a wide variety oflogical functions as well as to store data.

• Input/Output Blocks (IOBs) control the flow of databetween the I/O pins and the internal logic of thedevice. Each IOB supports bidirectional data flow plus3-state operation. Twenty-four different signalstandards, including seven high-performancedifferential standards, are available as shown inTable 2. Double Data-Rate (DDR) registers areincluded. The Digitally Controlled Impedance (DCI)feature provides automatic on-chip terminations,simplifying board designs.

• Block RAM provides data storage in the form of 18-Kbitdual-port blocks.

• Multiplier blocks accept two 18-bit binary numbers asinputs and calculate the product.

• Digital Clock Manager (DCM) blocks provideself-calibrating, fully digital solutions for distributing,delaying, multiplying, dividing, and phase shifting clocksignals.

These elements are organized as shown in Figure 1. A ringof IOBs surrounds a regular array of CLBs. The XC3S50has a single column of block RAM embedded in the array.Those devices ranging from the XC3S200 to the XC3S2000have two columns of block RAM. The XC3S4000 andXC3S5000 devices have four RAM columns. Each columnis made up of several 18K-bit RAM blocks; each block isassociated with a dedicated multiplier. The DCMs are posi-tioned at the ends of the outer block RAM columns.

The Spartan-3 family features a rich network of traces andswitches that interconnect all five functional elements,transmitting signals among them. Each functional elementhas an associated switch matrix that permits multiple con-nections to the routing.

Figure 1: Spartan-3 Family Architecture

DS099-1_01_032703

Notes: 1. The two additional block RAM columns of the XC3S4000 and XC3S5000

devices are shown with dashed lines. The XC3S50 has only the block RAM column on the far left.

pluswires

From: Xilinx Spartan 3 data sheet, simplified.

Page 44: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Spartan-3 FPGA Family: Introduction and Ordering Information

2 www.xilinx.com DS099-1 (v1.3) July 13, 20041-800-255-7778 Preliminary Product Specification

6

R

Architectural Overview

The Spartan-3 family architecture consists of five funda-mental programmable functional elements:

• Configurable Logic Blocks (CLBs) contain RAM-basedLook-Up Tables (LUTs) to implement logic and storageelements that can be used as flip-flops or latches.CLBs can be programmed to perform a wide variety oflogical functions as well as to store data.

• Input/Output Blocks (IOBs) control the flow of databetween the I/O pins and the internal logic of thedevice. Each IOB supports bidirectional data flow plus3-state operation. Twenty-four different signalstandards, including seven high-performancedifferential standards, are available as shown inTable 2. Double Data-Rate (DDR) registers areincluded. The Digitally Controlled Impedance (DCI)feature provides automatic on-chip terminations,simplifying board designs.

• Block RAM provides data storage in the form of 18-Kbitdual-port blocks.

• Multiplier blocks accept two 18-bit binary numbers asinputs and calculate the product.

• Digital Clock Manager (DCM) blocks provideself-calibrating, fully digital solutions for distributing,delaying, multiplying, dividing, and phase shifting clocksignals.

These elements are organized as shown in Figure 1. A ringof IOBs surrounds a regular array of CLBs. The XC3S50has a single column of block RAM embedded in the array.Those devices ranging from the XC3S200 to the XC3S2000have two columns of block RAM. The XC3S4000 andXC3S5000 devices have four RAM columns. Each columnis made up of several 18K-bit RAM blocks; each block isassociated with a dedicated multiplier. The DCMs are posi-tioned at the ends of the outer block RAM columns.

The Spartan-3 family features a rich network of traces andswitches that interconnect all five functional elements,transmitting signals among them. Each functional elementhas an associated switch matrix that permits multiple con-nections to the routing.

Figure 1: Spartan-3 Family Architecture

DS099-1_01_032703

Notes:

1. The two additional block RAM columns of the XC3S4000 and XC3S5000 devices are shown with dashed lines. The XC3S50 has only the block RAM column on the far left.

Why Xilinx wires are so slow ...Wires are slow because (1) each green dot is a transistor switch (2) path may not be shortest length (3) all wires are too long!

The best Xilinx users “write Verilog to the grid”. When Xilinx designs FPGA chips, wiring channels are optimized for (2) & (3).

Connect this

To this

Page 45: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

What are the green dots?

!"#$%&'())*++,!-.)'/012)*/345647&1'--

!"#$%&$'($)**)+,-,./

01).23#"%)$#%4"#5%.'6

78%*)9#%'$%+$#)9%2$'"":;',<.%

2'<<#2.,'<"%,<%.3#%,<.#$2'<<#2.

=8%5#>,<#%.3#%>4<2.,'<%'>%.3#%-'(,2%

+-'29"

?8%"#.%4"#$%';.,'<"6

89$:;$%':;1'<=&$2'><=2?@

8$%':;1'$%"A:B=A:"A:'><=2?@

8&<=>7<'#1@1:B2<=2?

0@A'<>,(4$).,'<%+,.%".$#)*B%2)<%

+#%-')5#5%4<5#$%4"#$%2'<.$'-6

CD--%-).23#"%)$#%".$4<(%.'(#.3#$%

,<%)%"3,>.%23),<6

01).23:+)"#5%EF,-,<GH%D-.#$)H%IJ

K$#2'<>,(4$)+-#

CL'-).,-#

C$#-).,L#-/%-)$(#8

-).23FFA “cross-point connection”

!"#$%&'())* ++,!-.)'/ 012)*/3456 47&1'-)

!"#$%&'()'*)+,-

. !'/)0)1-%+2%!"#$3-%4)221(%),5

6 789-):'0%/1',-%+2%)/701/1,*),;%

<-1(%7(+;('//'=)0)*9>

6 '((',;1/1,*%+2%),*1(:+,,1:*)+,%

?)(1->%',4

6 *81%='-):%2<,:*)+,'0)*9%+2%*81%

0+;):%=0+:@-A

. B+-*%-);,)2):',*%4)221(1,:1%)-%),%

*81%/1*8+4%2+(%7(+C)4),;%201D)=01%

=0+:@-%',4%:+,,1:*)+,-5%

. $,*)E2<-1%='-14%F1D5%$:*10G

H I+,EC+0'*)01>%(10'*)C109%-/'00

6 2)D14%F,+,E(17(+;('//'=01G

Set during configuration.

One flip-flop and a pass gate for each switch point. In order to have enough wires in the channels to wire up CLBs for most circuits, we need a lot of switch points! Thus, “80%+ of FPGA is for wiring”.

Page 46: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la
Page 47: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Clocks have dedicated wires (low skew)

Spartan-3 FPGA Family: Functional Description

30 www.xilinx.com DS099-2 (v1.3) August 24, 2004Preliminary Product Specification

40

R

width of the die. In turn, the horizontal spine branches out into a subsidiary clock interconnect that accesses the CLBs.

2. The clock input of either DCM on the same side of the die — top or bottom — as the BUFGMUX element in use.

A Global clock input is placed in a design using either aBUFGMUX element or the BUFG (Global Clock Buffer) ele-ment. For the purpose of minimizing the dynamic power dis-sipation of the clock network, the Xilinx developmentsoftware automatically disables all clock line segments thata design does not use.

Figure 18: Spartan-3 Clock Network (Top View)

4

4

4

4

4

4

4

8

8

4

4

88

Horizontal Spine

Top

Spi

neB

otto

m S

pine

4

DCM DCM

DCM DCM

Array Dependent

Array Dependent

DS099-2_18_070203

4 BUFGMUX

GCLK2GCLK3

GCLK0GCLK1

4 BUFGMUX

GCLK6 GCLK4GCLK7 GCLK5

From: Xilinx Spartan 3 data sheet. Virtex issimilar.

Page 48: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

CS 152 L5: Timing UC Regents Fall 2006 © UCB

Diephoto:XilinxVirtex

Gold wiresare the clock tree.

Page 49: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

CS 152 L5: Timing UC Regents Fall 2006 © UCB

the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [16] was required to tune themore than 2000 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3D visualizations. Figure 8 visualizes four ofthe 64 sector trees containing about 125 tuned wiresdriving 1/16th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [17, 18]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 15 200 clock pins on the chip,regardless of proximity.

From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.

Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed

Figure 6

Schematic diagram of global clock generation and distribution.

PLL

Bypass

Referenceclock in

Referenceclock out

Clock distributionClock out

Figure 7

3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.

Del

ayGrid

Tunedsectortrees

Sectorbuffers

Buffer level 2

Buffer level 1

y

x

Figure 8

Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.

Del

ay Multiple-fingeredtransmissionline

yx

J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002

32

Clock Tree Delays,

IBM “Power” CPU

Dela

y

Page 50: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

CS 152 L5: Timing UC Regents Fall 2006 © UCB

the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [16] was required to tune themore than 2000 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3D visualizations. Figure 8 visualizes four ofthe 64 sector trees containing about 125 tuned wiresdriving 1/16th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [17, 18]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 15 200 clock pins on the chip,regardless of proximity.

From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.

Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed

Figure 6

Schematic diagram of global clock generation and distribution.

PLL

Bypass

Referenceclock in

Referenceclock out

Clock distributionClock out

Figure 7

3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.

Del

ay

Grid

Tunedsectortrees

Sectorbuffers

Buffer level 2

Buffer level 1

y

x

Figure 8

Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.

Del

ay Multiple-fingeredtransmissionline

yx

J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002

32

Clock Tree Delays, IBM Power

clock grid was completed with a tool run at the chip level,connecting unit-level pins to the grid. At this point, theclock tuning and the bottom-up clock routing process stillhave a great deal of flexibility to respond rapidly to evenlate changes. Repeated practice routing and tuning wereperformed by a small, focused global clock team as theclock pins and buffer placements evolved to guaranteefeasibility and speed the design process.

Measurements of jitter and skew can be carried outusing the I/Os on the chip. In addition, approximately 100top-metal probe pads were included for direct probingof the global clock grid and buffers. Results on actualPOWER4 microprocessor chips show long-distanceskews ranging from 20 ps to 40 ps (cf. Figure 9). This isimproved from early test-chip hardware, which showedas much as 70 ps skew from across-chip channel-lengthvariations [19]. Detailed waveforms at the input andoutput of each global clock buffer were also measuredand compared with simulation to verify the specializedmodeling used to design the clock grid. Good agreementwas found. Thus, we have achieved a “correct-by-design”clock-distribution methodology. It is based on our designexperience and measurements from a series of increasinglyfast, complex server microprocessors. This method resultsin a high-quality global clock without having to usefeedback or adjustment circuitry to control skews.

Circuit designThe cycle-time target for the processor was set early in theproject and played a fundamental role in defining thepipeline structure and shaping all aspects of the circuitdesign as implementation proceeded. Early on, criticaltiming paths through the processor were simulated indetail in order to verify the feasibility of the designpoint and to help structure the pipeline for maximumperformance. Based on this early work, the goal for therest of the circuit design was to match the performance setduring these early studies, with custom design techniquesfor most of the dataflow macros and logic synthesis formost of the control logic—an approach similar to thatused previously [20]. Special circuit-analysis and modelingtechniques were used throughout the design in order toallow full exploitation of all of the benefits of the IBMadvanced SOI technology.

The sheer size of the chip, its complexity, and thenumber of transistors placed some important constraintson the design which could not be ignored in the push tomeet the aggressive cycle-time target on schedule. Theseconstraints led to the adoption of a primarily static-circuitdesign strategy, with dynamic circuits used only sparinglyin SRAMs and other critical regions of the processor core.Power dissipation was a significant concern, and it was akey factor in the decision to adopt a predominantly static-circuit design approach. In addition, the SOI technology,

including uncertainties associated with the modelingof the floating-body effect [21–23] and its impact onnoise immunity [22, 24 –27] and overall chip decouplingcapacitance requirements [26], was another factor behindthe choice of a primarily static design style. Finally, thesize and logical complexity of the chip posed risks tomeeting the schedule; choosing a simple, robust circuitstyle helped to minimize overall risk to the projectschedule with most efficient use of CAD tool and designresources. The size and complexity of the chip alsorequired rigorous testability guidelines, requiring almostall cycle boundary latches to be LSSD-compatible formaximum dc and ac test coverage.

Another important circuit design constraint was thelimit placed on signal slew rates. A global slew rate limitequal to one third of the cycle time was set and enforcedfor all signals (local and global) across the whole chip.The goal was to ensure a robust design, minimizingthe effects of coupled noise on chip timing and alsominimizing the effects of wiring-process variability onoverall path delay. Nets with poor slew also were foundto be more sensitive to device process variations andmodeling uncertainties, even where long wires and RCdelays were not significant factors. The general philosophywas that chip cycle-time goals also had to include theslew-limit targets; it was understood from the beginningthat the real hardware would function at the desiredcycle time only if the slew-limit targets were also met.

The following sections describe how these designconstraints were met without sacrificing cycle time. Thelatch design is described first, including a description ofthe local clocking scheme and clock controls. Then thecircuit design styles are discussed, including a description

Figure 9

Global clock waveforms showing 20 ps of measured skew.

1.5

1.0

0.5

0.0

0 500 1000 1500 2000 2500

20 ps skew

Vol

ts (

V)

Time (ps)

IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. D. WARNOCK ET AL.

33

Page 51: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Key things to remember about FPGAs ...

Calinx Xilinx chip is large but not extremely large: 38,400 LUT + FF + adder carry chain, 655 kb block RAM.

Normal designs: critical path 80% wire delay, 20% LUT delays. The best designers can flip these percentages.

Xilinx wires are fake. The cross-points in the path make wire slow.

Tools: Global timing constraints, region locking.

Page 52: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

The analogy holds up ...

CLBs are “real” elements, with real physics. Not a simulation of physics.

Configurability has a price: lower performance, wasted resources.

Page 53: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

CS 152 L5: Timing UC Regents Fall 2006 © UCB

Cisco Systemsproducts often use

FPGAs

How can FPGAs be improved to work

better for a particular type of product?

Masks/WafersTest & Engineering

Software

Design/Verification& Layout

$45M

$40M

$35M

$30M

$25M

$20M

$15M

The days are long gone when you could spend a few hundred thousand dollars and six months developing an ASIC, drop it into a piece of equipment, then sit back and watch it sell for years. Development costs have spiraled into the tens of millions. Development times are stretching into double-digit months. And the ASICs themselves have become so complicated, half of them have to go back to be re-spun, while another 30% have to take a second or even third trip back to the drawing board, adding another three months—minimum—to the schedule.

That would be bad enough, except that rapid time to market is more valuable than ever. And more expensive than ever should you fail to attain it. In a recent speech, no less an authority than John Chambers of Cisco observed that every four-week delay in product availability cost his company 14% market share.

Four weeks.Considering the pace of these markets, can you really

afford to build something really expensive, that you can’t change, and that isn’t going to be finished for two years?

Those are your unattractive choices, if you choose to go the ASIC route. And that’s the good news. The bad news is, it’s only going to get worse.

Increased complexity inevitably leads to increased costs. And in the evolving networking, telecom, wireless, and storage markets, complexity is always going to increase. The ASIC manufacturers’ answer to dealing with this increased complexity is to reduce their geometries. Theoretically, that’ll reduce costs.

But it doesn’t. In fact, it does just the opposite.

Risk is no longer an option. Or a necessity.

Example: cswitch, an FPGA startup.

Page 54: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

CS 152 L5: Timing UC Regents Fall 2006 © UCB

For a custom chip to deliver high performance in all types of networks, it has to move, store, and edit packets at very high speeds. The Configurable Switch Array chip does just that. It’s the first configu-rable solution to deliver bandwidth at 40 to 100 Gbps for a range of applications, making it capable of moving up to 6 TBps of packets at speeds of up to 2 GHz. To handle editing tasks, the chip packs a rich assortment of Frame Header Parsers, Arithmetic Units, and RCAMs to edit and classify packets at 1 GHz speeds. For storing packets, the chip includes over 18 Mb of on-chip memory, as well as support for the latest high-speed memories, such as DDR2, RLDRAM2, and QDR2.

All of these elements are, of course, completely configurable by your engineers.

Configurability has always come at a price, with the trade-offs in low gate density, inadequate performance, or high power. Not any more. With over 7 million equivalent ASIC gates, speeds of up to 100 Gbps on chip, and the very latest in power management techniques, the Configurable Switch Array chip makes configurability worthwhile by making it uncompromisingly available.

For the very first time, the Configurable Switch Array chip brings the considerable advantages of high performance at low power to all kinds of networking applications. Using the latest advancements in power management—some of which were invented by our design team—the chip automatically reduces power by shutting off clocks to sectors not in use, and allows designers to vary chip voltage to achieve the optimum total power requirements. This holistic approach to power-managed performance represents a substantial and welcome breakthrough for equipment suppliers and customers alike.

Never before have so many resources been dedicated to your success.

Just like a normal FPGA, based on an array architecture ...

Xilinx-style Configurable Logic Blocks

Block R

AM

Block R

AM

Block R

AM

Block R

AM

Page 55: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

CS 152 L5: Timing UC Regents Fall 2006 © UCB

For a custom chip to deliver high performance in all types of networks, it has to move, store, and edit packets at very high speeds. The Configurable Switch Array chip does just that. It’s the first configu-rable solution to deliver bandwidth at 40 to 100 Gbps for a range of applications, making it capable of moving up to 6 TBps of packets at speeds of up to 2 GHz. To handle editing tasks, the chip packs a rich assortment of Frame Header Parsers, Arithmetic Units, and RCAMs to edit and classify packets at 1 GHz speeds. For storing packets, the chip includes over 18 Mb of on-chip memory, as well as support for the latest high-speed memories, such as DDR2, RLDRAM2, and QDR2.

All of these elements are, of course, completely configurable by your engineers.

Configurability has always come at a price, with the trade-offs in low gate density, inadequate performance, or high power. Not any more. With over 7 million equivalent ASIC gates, speeds of up to 100 Gbps on chip, and the very latest in power management techniques, the Configurable Switch Array chip makes configurability worthwhile by making it uncompromisingly available.

For the very first time, the Configurable Switch Array chip brings the considerable advantages of high performance at low power to all kinds of networking applications. Using the latest advancements in power management—some of which were invented by our design team—the chip automatically reduces power by shutting off clocks to sectors not in use, and allows designers to vary chip voltage to achieve the optimum total power requirements. This holistic approach to power-managed performance represents a substantial and welcome breakthrough for equipment suppliers and customers alike.

Never before have so many resources been dedicated to your success.

Except some rows are specialized for network products ...

Packet Parser: Simple fast CPUs specialized for packet processing.

Specialized logic for computing packet checksums.

Content-addressable memory: “smart” memory for routing tables.

Page 56: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

CS 152 L5: Timing UC Regents Fall 2006 © UCB

For a custom chip to deliver high performance in all types of networks, it has to move, store, and edit packets at very high speeds. The Configurable Switch Array chip does just that. It’s the first configu-rable solution to deliver bandwidth at 40 to 100 Gbps for a range of applications, making it capable of moving up to 6 TBps of packets at speeds of up to 2 GHz. To handle editing tasks, the chip packs a rich assortment of Frame Header Parsers, Arithmetic Units, and RCAMs to edit and classify packets at 1 GHz speeds. For storing packets, the chip includes over 18 Mb of on-chip memory, as well as support for the latest high-speed memories, such as DDR2, RLDRAM2, and QDR2.

All of these elements are, of course, completely configurable by your engineers.

Configurability has always come at a price, with the trade-offs in low gate density, inadequate performance, or high power. Not any more. With over 7 million equivalent ASIC gates, speeds of up to 100 Gbps on chip, and the very latest in power management techniques, the Configurable Switch Array chip makes configurability worthwhile by making it uncompromisingly available.

For the very first time, the Configurable Switch Array chip brings the considerable advantages of high performance at low power to all kinds of networking applications. Using the latest advancements in power management—some of which were invented by our design team—the chip automatically reduces power by shutting off clocks to sectors not in use, and allows designers to vary chip voltage to achieve the optimum total power requirements. This holistic approach to power-managed performance represents a substantial and welcome breakthrough for equipment suppliers and customers alike.

Never before have so many resources been dedicated to your success.

The I/O pins speak Ethernet and other network standards.

“SerDes” Serial-DeserializerLogic.

Converts serial data of Ethernet to parallel bytes at the serial “line-rate”.

This “slow, wide” parallel representation lets the FPGA keep up with 1 Gbit Ethernet.

MAC:“Media AccessControl” logic.

“Softwired” logic for the lowest layers of Ethernet -- can be configured for different standards.

Page 57: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Timing Conclusions

Flip-flop delay: setup and clk-to-Q

Logic delay: fan-out and wires

Critical path limits clock period

Xilinx timing: mapping logic into CLBs, routing onto fake wires.

Page 58: Computer Architecture and Engineering Lecture 5 Timingcs152/fa06/lecnotes/lec3-1.pdf · Lec3.7 ¡W ire s: C a rry signa ls from one point to a nothe r ¥ S ingle bit ( no siz e la

UC Regents Fall 2006 © UCBCS 152 L5: Timing

Where we are now, and what is next

We have a top-down view of how signals move through a processor in time

How to pipeline ...

Why pipeline processors?Performance!