Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
CS 152 L5: Timing UC Regents Fall 2006 © UCB
2006-9-12John Lazzaro
(www.cs.berkeley.edu/~lazzaro)
CS 152 Computer Architecture and Engineering
Lecture 5 – Timing
www-inst.eecs.berkeley.edu/~cs152/
TAs: Udam Saini and Jue Sun
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Last Time: Making a Test Plan ...
Which testing types are good for each epoch?
processorassemblycomplete
correctlyexecutes
singleinstructions
correctlyexecutes
shortprograms
Time
Epoch 1 Epoch 2 Epoch 3 Epoch 4unit testingearly
multiunit
testinglater
processortesting
withself-checks
multi-unit testing
unit testing
diagnostics
complete processor
testingverification
processortesting
withself-checks
diagnostics
processortesting
withself-checks
multi-unit testing
unit testing
diagnostics
complete processor
testing
Top-downtesting
Bottom-uptesting
unit testing
multi-unit testing
processortesting
withself-checks
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Idea: get confidence in “going to board” earlier ...
processorassemblycomplete
correctlyexecutes
singleinstructions
correctlyexecutes
shortprograms
Time
Epoch 1 Epoch 2 Epoch 3 Epoch 4complete processor
testing
Top-downtesting
Bottom-uptesting
unit testing
multi-unit testing
processortesting
withself-checks
ModelSim
20 %
Xilinx
80 %
ModelSim
80 %
Xilinx
20 %
ModelSim
20 %
Xilinx
80 %
ModelSim
20 %
Xilinx
80 %
Also: catch Synplicity “warnings and errors” earlier“latch generated”, “combinational loop detected”, etc
Last Time: Works in ModelSim, but ...
Human Strategies▶ Group design effort
§ Everyone is clear on the specs▶ Modular work effort
§ Divide the work between all your teammates
§ Avoid having 4 people working on 1 screen
▶ But help each other test§ Fresh eyes catch different bugs
Teamwork lessons from previous semesters ...
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Today: Determine minimum clock period
32rd1
RegFile
32rd2
WE32wd
5rs1
5rs2
5ws
ExtRegDest
ALUsrcExtOp
ALUctr
32A
L
U
32
32
op
MemToReg
32Dout
Data Memory
WE32
Din
Addr
MemWr
Equal
RegWr
Equal
Control Lines
Combinational Logic
Clk
32
Addr Data
Instr
Mem
32D
PC
Q
32
32
+
32
32
0x4
PCSrc
32
+
32
CS 152 L06 Single Cycle 1 (6) UC Regents Fall 2004 © UCB
Step 1a: The MIPS-lite Subset for today
° ADD and SUB• addU rd, rs, rt• subU rd, rs, rt
° OR Immediate:• ori rt, rs, imm16
° LOAD and STORE Word• lw rt, rs, imm16• sw rt, rs, imm16
° BRANCH:• beq rs, rt, imm16
op rs rt rd shamt funct061116212631
6 bits 6 bits5 bits5 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
E
x
t
e
n
d
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Today’s Lecture: Timing Analysis
Xilinx and delay
Clocked logic and delay
Combinational logic delay
CS 152 L5: Timing UC Regents Fall 2006 © UCB
View from 10,000 Feet
UC Regents Fall 2006 © UCBCS 152 L5: Timing
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Architects draw blocks ...Circuit designers draw transistors
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8
!"#$%&#&%'()*+#$,-%-)./)0.1%2)3($#,
4 5678.'(9):8#+-%-&.8);.9($<))
!"#$4 =()8(/(8)&.)&8#+-%-&.8)>-&8(+1&?>)#-)
&?()#6."+&)./)2"88(+&)&?#&)/$.@-)/.8)
#)1%'(+ A9- #+9 A1-B)
4 :?()-&8(+1&?)%-)$%+(#8$,)78.7.8&%.+#$)
&.)&?()8#&%.)./)=C0B)
%"#$
Logic is where they meet. !"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8
!"#$%&#&%'()*+#$,-%-)./)0.1%2)3($#,
4 5678.'(9):8#+-%-&.8);.9($<))
!"#$4 =()8(/(8)&.)&8#+-%-&.8)>-&8(+1&?>)#-)
&?()#6."+&)./)2"88(+&)&?#&)/$.@-)/.8)
#)1%'(+ A9- #+9 A1-B)
4 :?()-&8(+1&?)%-)$%+(#8$,)78.7.8&%.+#$)
&.)&?()8#&%.)./)=C0B)
%"#$
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Architects reach logic top-down ...
Next State Combinational Logic
next_Gnext_R next_YR G Y
ChangeRst
wire next_R, next_Y, next_G;
assign next_R = rst ? 1’b1 : (change ? Y : R); assign next_Y = rst ? 1’b0 : (change ? G : Y);assign next_G = rst ? 1’b0 : (change ? R : G);
... using Verilog and schematics.
UC Regents Fall 2006 © UCBCS 152 L5: Timing
EEs reach logic bottom-up ...
Can you build a processorentirely out of NAND gates?
Small number of high-performance
logic circuits.
For some definition of “small” and
“high-performance”
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.33
Basic Components: CMOS Logic Gates
NOR GateNAND Gate
A B Out
0 0 10 1 11 0 11 1 0
A B Out
0 0 10 1 01 0 01 1 0
OutA
B
A
B
Out
Out = A + BOut = A • B
Vdd
A
B
Out
Vdd
A
B
Out
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.34
Basic Components: CMOS Logic Gates
Out
A
B
C
D
More Inputs More asymmetric Edges Times!
Vdd
Out
B
C
D
A
4-input NAND Gate
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.35
Ideal versus Reality
° When input 0 -> 1, output 1 -> 0 but NOT instantly• Output goes 1 -> 0: output voltage goes from Vdd (5v) to 0v
° When input 1 -> 0, output 0 -> 1 but NOT instantly• Output goes 0 -> 1: output voltage goes from 0v to Vdd (5v)
° Voltage does not like to change instantaneously
Vin
Vout
1 => Vdd
VoltageOutIn
0 => GND
Time
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.36
Fluid Timing Model
Level (V) = Vdd
Vdd
° Water ! Electrical Charge Tank Capacity ! Capacitance (C)
° Water Level ! Voltage Water Flow ! Charge Flowing (Current)
° Size of Pipes ! Strength of Transistors (G)
° Time to fill up the tank proportional to C / G
Reservoir Tank
(Cout)Bottomless Sea
Sea Level
(GND)
SW2SW1SW1
Tank Level (Vout)
Cout
Vout
SW2
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Logic Synthesis often bridges the gap ...
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.5
Design Refinement
Informal System Requirement
Initial Specification
Intermediate Specification
Final Architectural Description
Intermediate Specification of Implementation
Final Internal Specification
Physical Implementation
refinementincreasing level of detail
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.6
Logic Components
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.7
° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)
° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay
° Flip-Flops: Storage Elements• After a clock edge, input copied to output
• Otherwise, the flip-flop holds its value
• Also: a “Latch” is a storage element that is level triggered
D Q D[8] Q[8]
8
Combinational
Logic
11
8
Elements of the design zoo
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.8
Basic Combinational Elements+DeMorgan Equivalence
Wire Inverter
In Out
01
01
In Out
10
01
OutIn
Out = InOut = In
NAND Gate NOR GateA B Out
111
0 00 11 01 1 0
A B Out
0 0 10 1 01 0 01 1 0
OutA
BA
B
Out
DeMorgan’s
TheoremOut = A + B = A • BOut = A • B = A + B
A
B
Out
A B Out
1 1 11 0 10 1 10 0 0
0 00 11 01 1
A B
OutA
B
A B Out
1 1 11 0 00 1 00 0 0
0 00 11 01 1
A B
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.5
Design Refinement
Informal System Requirement
Initial Specification
Intermediate Specification
Final Architectural Description
Intermediate Specification of Implementation
Final Internal Specification
Physical Implementation
refinementincreasing level of detail
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.6
Logic Components
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.7
° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)
° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay
° Flip-Flops: Storage Elements• After a clock edge, input copied to output
• Otherwise, the flip-flop holds its value
• Also: a “Latch” is a storage element that is level triggered
D Q D[8] Q[8]
8
Combinational
Logic
11
8
Elements of the design zoo
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.8
Basic Combinational Elements+DeMorgan Equivalence
Wire Inverter
In Out
01
01
In Out
10
01
OutIn
Out = InOut = In
NAND Gate NOR GateA B Out
111
0 00 11 01 1 0
A B Out
0 0 10 1 01 0 01 1 0
OutA
BA
B
Out
DeMorgan’s
TheoremOut = A + B = A • BOut = A • B = A + B
A
B
Out
A B Out
1 1 11 0 10 1 10 0 0
0 00 11 01 1
A B
OutA
B
A B Out
1 1 11 0 00 1 00 0 0
0 00 11 01 1
A B
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.5
Design Refinement
Informal System Requirement
Initial Specification
Intermediate Specification
Final Architectural Description
Intermediate Specification of Implementation
Final Internal Specification
Physical Implementation
refinementincreasing level of detail
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.6
Logic Components
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.7
° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)
° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay
° Flip-Flops: Storage Elements• After a clock edge, input copied to output
• Otherwise, the flip-flop holds its value
• Also: a “Latch” is a storage element that is level triggered
D Q D[8] Q[8]
8
Combinational
Logic
11
8
Elements of the design zoo
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.8
Basic Combinational Elements+DeMorgan Equivalence
Wire Inverter
In Out
01
01
In Out
10
01
OutIn
Out = InOut = In
NAND Gate NOR GateA B Out
111
0 00 11 01 1 0
A B Out
0 0 10 1 01 0 01 1 0
OutA
BA
B
Out
DeMorgan’s
TheoremOut = A + B = A • BOut = A • B = A + B
A
B
Out
A B Out
1 1 11 0 10 1 10 0 0
0 00 11 01 1
A B
OutA
B
A B Out
1 1 11 0 00 1 00 0 0
0 00 11 01 1
A B
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.5
Design Refinement
Informal System Requirement
Initial Specification
Intermediate Specification
Final Architectural Description
Intermediate Specification of Implementation
Final Internal Specification
Physical Implementation
refinementincreasing level of detail
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.6
Logic Components
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.7
° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)
° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay
° Flip-Flops: Storage Elements• After a clock edge, input copied to output
• Otherwise, the flip-flop holds its value
• Also: a “Latch” is a storage element that is level triggered
D Q D[8] Q[8]
8
Combinational
Logic
11
8
Elements of the design zoo
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.8
Basic Combinational Elements+DeMorgan Equivalence
Wire Inverter
In Out
01
01
In Out
10
01
OutIn
Out = InOut = In
NAND Gate NOR GateA B Out
111
0 00 11 01 1 0
A B Out
0 0 10 1 01 0 01 1 0
OutA
BA
B
Out
DeMorgan’s
TheoremOut = A + B = A • BOut = A • B = A + B
A
B
Out
A B Out
1 1 11 0 10 1 10 0 0
0 00 11 01 1
A B
OutA
B
A B Out
1 1 11 0 00 1 00 0 0
0 00 11 01 1
A B
assign next_R = rst ? 1’b1 : (change ? Y : R); assign next_Y = rst ? 1’b0 : (change ? G : Y);assign next_G = rst ? 1’b0 : (change ? R : G);
Still, in the highest performance
designs, human designers do (some) logic, circuits, and
layout by hand.
CS 152 L5: Timing UC Regents Fall 2006 © UCB
A Logic Circuit Primer
“Models should be as simple as possible, but no simpler ...” Albert Einstein.
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Inverters: A simple transistor model
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.5
Design Refinement
Informal System Requirement
Initial Specification
Intermediate Specification
Final Architectural Description
Intermediate Specification of Implementation
Final Internal Specification
Physical Implementation
refinementincreasing level of detail
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.6
Logic Components
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.7
° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)
° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay
° Flip-Flops: Storage Elements• After a clock edge, input copied to output
• Otherwise, the flip-flop holds its value
• Also: a “Latch” is a storage element that is level triggered
D Q D[8] Q[8]
8
Combinational
Logic
11
8
Elements of the design zoo
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.8
Basic Combinational Elements+DeMorgan Equivalence
Wire Inverter
In Out
01
01
In Out
10
01
OutIn
Out = InOut = In
NAND Gate NOR GateA B Out
111
0 00 11 01 1 0
A B Out
0 0 10 1 01 0 01 1 0
OutA
BA
B
Out
DeMorgan’s
TheoremOut = A + B = A • BOut = A • B = A + B
A
B
Out
A B Out
1 1 11 0 10 1 10 0 0
0 00 11 01 1
A B
OutA
B
A B Out
1 1 11 0 00 1 00 0 0
0 00 11 01 1
A B
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
“1”
“0”
pFET.A switch. “On” if gate is grounded.
nFET.A switch. “On” if gate is at Vdd.
“1”“0”
“1” “0”
This model is too simple to be useful ...
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Transistors as water valvesIf electrons are water molecules,
and a capacitor a bucket ...
A “on” p-FET fillsup the capacitor
with charge.
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
A “on” n-FET empties the
bucket.
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)
!"#$%&'(#)*(+,%-$*".(/0
1 2+.$0#$03
1 4546%,"#$3
“1”
“0”Time
Water level
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)
!"#$%&'(#)*(+,%-$*".(/0
1 2+.$0#$03
1 4546%,"#$3
“0”
“1”
TimeWater level
This model is often good enough ...
UC Regents Fall 2006 © UCBCS 152 L5: Timing
What is the bucket? A gate’s “fan-out”.
Driving other gates slows a gate down.
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)
!"#$%&'(#)*(+,%-$*".(/0
1 2+.$0#$03
1 4546%,"#$3
Driving wires slows a gate down.
“Fan-out”: The number of gate inputs driven by a gate’s output.
CS 152 L5: Timing UC Regents Fall 2006 © UCB
Why we call it “fan-out”
UC Regents Fall 2006 © UCBCS 152 L5: Timing
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-(
!"#$%&$'"(
) *"+,-.#/
) 01$%2$'"(%-3%"%4"#$%56%78-7-8#5-+"'%#-%5#6%-.#7.#%9"7"95#"+9$:%%;$9".6$<%4"#$6%=>%"+2%?%#.8+%-+@-33%"#%"%'"#$8%#5A$:%%BC#%#"D$6%'-+4$8%3-8%#1$%-.#7.#%-3%4"#$%=E%#-%8$"91%#1$%6F5#915+4%#18$61-'2%-3%4"#$6%=>%"+2%?%"6 F$%"22%A-8$%-.#7.#%9"7"95#"+9$:G
E
?
>
A closer look at fan-out ...
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.37
Series Connection
Vdd
Cout
Vout
C1
V1G2
Vdd
Voltage
Vdd
Vin
GND
V1 Vout
Vdd/2
d1 d2
G1
V1Vin Vout
VinG1 G2
Time
° Total Propagation Delay = Sum of individual delays = d1 + d2
° Capacitance C1 has two components:
• Capacitance of the wire connecting the two gates
• Input capacitance of the second inverter
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.38
Calculating Aggregate Delays
Vdd
G2
Vdd
° Sum delays along serial paths
° Delay (Vin -> V2) ! = Delay (Vin -> V3)• Delay (Vin -> V2) = Delay (Vin -> V1) + Delay (V1 -> V2)
• Delay (Vin -> V3) = Delay (Vin -> V1) + Delay (V1 -> V3)
° Critical Path = The longest among the N parallel paths
° C1 = Wire C + Cin of Gate 2 + Cin of Gate 3
V2
V1Vin V2
G1V1
C1
Vin
Vdd
G3V3
V3
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.39
Characterize a Gate
° Input capacitance for each input
° For each input-to-output path:• For each output transition type (H->L, L->H, H->Z, L->Z ... etc.)
- Internal delay (ns)
- Load dependent delay (ns / fF)
° Example: 2-input NAND Gate
OutA
B
Delay A -> Out
Out: Low -> High
0.5ns
Slope =
0.0021ns / fF
For A and B: Input Load (I.L.) = 61 fF
For either A -> Out or B -> Out:
Tlh = 0.5ns Tlhf = 0.0021ns / fF
Thl = 0.1ns Thlf = 0.0020ns / fF
Cout
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.40
A Specific Example: 2 to 1 MUX
Y = (A and !S)
or (B and S)
A
B
S
Gate 3
Gate 2
Gate 1Wire 1
Wire 2
Wire 0
A
B
Y
S
2 x
1M
ux
° Input Load (I.L.)• A, B: I.L. (NAND) = 61 fF
• S: I.L. (INV) + I.L. (NAND) = 50 fF + 61 fF = 111 fF
° Load Dependent Delay (L.D.D.): Same as Gate 3• TAYlhf = 0.0021 ns / fF TAYhlf = 0.0020 ns / fF
• TBYlhf = 0.0021 ns / fF TBYhlf = 0.0020 ns / fF
• TSYlhf = 0.0021 ns / fF TSYlhf = 0.0020 ns / fF
Linear model
works for reasonable
fan-out
Driving more gates adds delay.
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Propagation delay graphs ...
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'--
!"#$%&$'"(
) *"+,"-$-%."#$+/
012#
034
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'--
!"#$%&$'"(
) *"+,"-$-%."#$+/
012#
034
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'--
!"#$%&$'"(
) *"+,"-$-%."#$+/
012#
034
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'--
!"#$%&$'"(
) *"+,"-$-%."#$+/
012#
034
1->0
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Intuition: Critical paths ...
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-*
!"#$%&$'"(
) *+",-.,/
) 01"#%.2%#1$%3$'"(%.,%#1.2%4.546.#7
) !"#$#%&'()&$*+(#1$%8"#1%9.#1%#1$%:";.:6:%3$'"(<%=5>:%",(%
.,86#%#>%",(%>6#86#?
@ A,%B$,$5"'<%9$%.,4'63$%5$B.2#$5%2$#-68%",3%4'C-#>-D%#.:$2%.,%
45.#.4"'%8"#1%4"'46'"#.>,?
) 01(%3>%9$%4"5$%"E>6#%#1$ %"#$#%&'(,&$*-
x = g(a, b, c, d, e, f)
If d going 0-to-1 switches x 0-to-1, delay is T1.
If a going 0-to-1 switches x 0-to-1, delay is T2.
It would be surprising if T1 > T2.
T1
T2
T2 might be the critical (worst-case delay) path.
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Why “might”? Wires have delay too ...
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&$'()
* +,$-%"-%./01$%2(1$1%3/$#$%./$%
.#(-14"11"0-%'"-$%$55$2.%"1%
-$6'"6"7'$8
9 !"#$1%:011$1%;"1.#"7<.$;%
#$1"1.(-2$%(-;%2(:(2".(-2$
9 ="4$%20-1.(-.%(1102"(.$;%3"./%
;"1.#"7<.$;%>?%"1%:#0:0#."0-('%.0%
./$%!"#$%& 05%./$%'$-6./
* @0#%!"#$%&'($)! 0-%A?1B%#$1"1.(-2$%"1%"-1"6-"5"2(-.%C#$'(.",$%.0%$55$2.",$%>%05%.#(-1"1.0#1DB%7<.%?%"1%"4:0#.(-.E
9 =):"2('')%(#0<-;%/('5%05%?%05%6(.$%'0(;%"1%"-%./$%3"#$1E
* @0#%*#+,&'($)! 0-%A?18
9 7<11$1B%2'02F%'"-$1B%6'07('%20-.#0'%1"6-('B%$.2E
9 >$1"1.(-2$%"1%1"6-"5"2(-.B%./$#$50#$%;"1.#"7<.$;%>?%$55$2.%;04"-(.$1E
9 1"6-('1%(#$%.):"2('')%G#$7<55$#$;H%.0%#$;<2$%;$'()8
,I
,J,K
,L
."4$
,I ,L ,K ,J
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&$'()
* +,$-%"-%./01$%2(1$1%3/$#$%./$%
.#(-14"11"0-%'"-$%$55$2.%"1%
-$6'"6"7'$8
9 !"#$1%:011$1%;"1.#"7<.$;%
#$1"1.(-2$%(-;%2(:(2".(-2$
9 ="4$%20-1.(-.%(1102"(.$;%3"./%
;"1.#"7<.$;%>?%"1%:#0:0#."0-('%.0%
./$%!"#$%& 05%./$%'$-6./
* @0#%!"#$%&'($)! 0-%A?1B%#$1"1.(-2$%"1%"-1"6-"5"2(-.%C#$'(.",$%.0%$55$2.",$%>%05%.#(-1"1.0#1DB%7<.%?%"1%"4:0#.(-.E
9 =):"2('')%(#0<-;%/('5%05%?%05%6(.$%'0(;%"1%"-%./$%3"#$1E
* @0#%*#+,&'($)! 0-%A?18
9 7<11$1B%2'02F%'"-$1B%6'07('%20-.#0'%1"6-('B%$.2E
9 >$1"1.(-2$%"1%1"6-"5"2(-.B%./$#$50#$%;"1.#"7<.$;%>?%$55$2.%;04"-(.$1E
9 1"6-('1%(#$%.):"2('')%G#$7<55$#$;H%.0%#$;<2$%;$'()8
,I
,J,K
,L
."4$
,I ,L ,K ,J
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&$'()
* +,$-%"-%./01$%2(1$1%3/$#$%./$%
.#(-14"11"0-%'"-$%$55$2.%"1%
-$6'"6"7'$8
9 !"#$1%:011$1%;"1.#"7<.$;%
#$1"1.(-2$%(-;%2(:(2".(-2$
9 ="4$%20-1.(-.%(1102"(.$;%3"./%
;"1.#"7<.$;%>?%"1%:#0:0#."0-('%.0%
./$%!"#$%& 05%./$%'$-6./
* @0#%!"#$%&'($)! 0-%A?1B%#$1"1.(-2$%"1%"-1"6-"5"2(-.%C#$'(.",$%.0%$55$2.",$%>%05%.#(-1"1.0#1DB%7<.%?%"1%"4:0#.(-.E
9 =):"2('')%(#0<-;%/('5%05%?%05%6(.$%'0(;%"1%"-%./$%3"#$1E
* @0#%*#+,&'($)! 0-%A?18
9 7<11$1B%2'02F%'"-$1B%6'07('%20-.#0'%1"6-('B%$.2E
9 >$1"1.(-2$%"1%1"6-"5"2(-.B%./$#$50#$%;"1.#"7<.$;%>?%$55$2.%;04"-(.$1E
9 1"6-('1%(#$%.):"2('')%G#$7<55$#$;H%.0%#$;<2$%;$'()8
,I
,J,K
,L
."4$
,I ,L ,K ,J
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&$'()
* +,$-%"-%./01$%2(1$1%3/$#$%./$%
.#(-14"11"0-%'"-$%$55$2.%"1%
-$6'"6"7'$8
9 !"#$1%:011$1%;"1.#"7<.$;%
#$1"1.(-2$%(-;%2(:(2".(-2$
9 ="4$%20-1.(-.%(1102"(.$;%3"./%
;"1.#"7<.$;%>?%"1%:#0:0#."0-('%.0%
./$%!"#$%& 05%./$%'$-6./
* @0#%!"#$%&'($)! 0-%A?1B%#$1"1.(-2$%"1%"-1"6-"5"2(-.%C#$'(.",$%.0%$55$2.",$%>%05%.#(-1"1.0#1DB%7<.%?%"1%"4:0#.(-.E
9 =):"2('')%(#0<-;%/('5%05%?%05%6(.$%'0(;%"1%"-%./$%3"#$1E
* @0#%*#+,&'($)! 0-%A?18
9 7<11$1B%2'02F%'"-$1B%6'07('%20-.#0'%1"6-('B%$.2E
9 >$1"1.(-2$%"1%1"6-"5"2(-.B%./$#$50#$%;"1.#"7<.$;%>?%$55$2.%;04"-(.$1E
9 1"6-('1%(#$%.):"2('')%G#$7<55$#$;H%.0%#$;<2$%;$'()8
,I
,J,K
,L
."4$
,I ,L ,K ,J
Looksbenign,but ...
CS 152 L5: Timing UC Regents Fall 2006 © UCB
Clocked Logic Circuits
UC Regents Fall 2006 © UCBCS 152 L5: Timing
From Delay Models to Timing Analysis1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8
!"#$%&'
( )#*#&&'&+,-+.'*/#&+0-12'*,'*3+
#
4 5+! ,/$'60&7"89+:+,/$'6$;"9+:+,/$'6.',;%9
5+! #0&7"8 :+#$;" :+#.',;%
0&7
f T1 MHz 1 μs
10 MHz 100 ns100 MHz 10 ns
1 GHz 1 ns
Timing AnalysisWhat is the
smallest T that produces correct
operation?
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Timing Analysis and Logic Delay
If T > worst-case delay through CL, does this ensure correct operation?
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
1600
IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001
Fig.1.ProcessSEMcrosssection.
Theprocess
wasraisedfrom[1]tolimitstandbypower.
Circuitdesignandarchitecturalpipeliningensurelowvoltage
performanceandfunctionality.Tofurtherlimitstandbycurrent
inhandheldASSPs,alongerpolytargettakesadvantageofthe
versus
dependenceandsource-to-bodybiasisused
toelectricallylimittransistor
instandbymode.Allcore
nMOSandpMOStransistorsutilizeseparatesourceandbulk
connectionstosupportthis.Theprocessincludescobaltdisili-
cidegatesanddiffusions.Lowsourceanddraincapacitance,as
wellas3-nmgate-oxidethickness,allowhighperformanceand
low-voltageoperation. III.ARCHITECTURE
Themicroprocessorcontains32-kBinstructionanddata
cachesaswellasaneight-entrycoalescingwritebackbuffer.
Theinstructionanddatacachefillbuffershavetwoandfour
entries,respectively.Thedatacachesupportshit-under-miss
operationandlinesmaybelockedtoallowSRAM-likeoper-
ation.Thirty-two-entryfullyassociativetranslationlookaside
buffers(TLBs)thatsupportmultiplepagesizesareprovided
forbothcaches.TLBentriesmayalsobelocked.A128-entry
branchtargetbufferimprovesbranchperformanceapipeline
deeperthanearlierhigh-performanceARMdesigns[2],[3].
A.PipelineOrganization
Toobtainhighperformance,themicroprocessorcoreutilizes
asimplescalarpipelineandahigh-frequencyclock.Inaddition
toavoidingthepotentialpowerwasteofasuperscalarapproach,
functionaldesignandvalidationcomplexityisdecreasedatthe
expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,
thepipelinepartitioningbalancestheworkloadandensuresthat
noonepipelinestageistight.Themainintegerpipelineisseven
stages,memoryoperationsfollowaneight-stagepipeline,and
whenoperatinginthumbmodeanextrapipestageisinserted
afterthelastfetchstagetoconvertthumbinstructionsintoARM
instructions.Sincethumbmodeinstructions[11]are16b,two
instructionsarefetchedinparallelwhileexecutingthumbin-
structions.Asimplifieddiagramoftheprocessorpipelineis
Fig.2.Microprocessorpipelineorganization.
showninFig.2,wherethestateboundariesareindicatedby
gray.Featuresthatallowthemicroarchitecturetoachievehigh
speedareasfollows.
TheshifterandALUresideinseparatestages.TheARMin-
structionsetallowsashiftfollowedbyanALUoperationina
singleinstruction.Previousimplementationslimitedfrequency
byhavingtheshiftandALUinasinglestage.Splittingthisop-
erationreducesthecriticalALUbypasspathbyapproximately
1/3.Theextrapipelinehazardintroducedwhenaninstructionis
immediatelyfollowedbyonerequiringthattheresultbeshifted
isinfrequent.
DecoupledInstructionFetch.Atwo-instructiondeepqueueis
implementedbetweenthesecondfetchandinstructiondecode
pipestages.Thisallowsstallsgeneratedlaterinthepipetobe
deferredbyoneormorecyclesintheearlierpipestages,thereby
allowinginstructionfetchestoproceedwhenthepipeisstalled,
andalsorelievesstallspeedpathsintheinstructionfetchand
branchpredictionunits.
Deferredregisterdependency
stalls.Whileregisterdepen-
denciesarecheckedintheRFstage,stallsduetothesehazards
aredeferreduntiltheX1stage.Allthenecessaryoperandsare
thencapturedfromresult-forwardingbussesastheresultsare
returnedtotheregisterfile.
Oneofthemajorgoalsofthedesignwastominimizetheen-
ergyconsumedtocompleteagiventask.Conventionalwisdom
hasbeenthatshorterpipelinesaremoreefficientduetore-
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.9
General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• Input load factor of each input
• Propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.10
Storage Element’s Timing Model
Clk
D Q
° Setup Time: Input must be stable BEFORE trigger clock edge
° Hold Time: Input must REMAIN stable after trigger clock edge
° Clock-to-Q time:
• Output cannot change instantaneously at the trigger clock edge
• Similar to delay in logic gates, two components:
- Internal Clock-to-Q
- Load dependent Clock-to-Q
Don’t Care Don’t Care
HoldSetup
D
Unknown
Clock-to-Q
Q
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.11
Clocking Methodology
Clk
Combination Logic.
.
.
.
.
.
.
.
.
.
.
.
° All storage elements are clocked by the same clock edge
° The combination logic blocks:• Inputs are updated at each clock tick
• All outputs MUST be stable before the next clock tick
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.12
Critical Path & Cycle Time
Clk
.
.
.
.
.
.
.
.
.
.
.
.
° Critical path: the slowest path between any two storage devices
° Cycle time is a function of the critical path
° must be greater than:
Clock-to-Q + Longest Path through Combination Logic + Setup
Register:
An Array of Flip-Flops
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.9
General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• Input load factor of each input
• Propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.10
Storage Element’s Timing Model
Clk
D Q
° Setup Time: Input must be stable BEFORE trigger clock edge
° Hold Time: Input must REMAIN stable after trigger clock edge
° Clock-to-Q time:
• Output cannot change instantaneously at the trigger clock edge
• Similar to delay in logic gates, two components:
- Internal Clock-to-Q
- Load dependent Clock-to-Q
Don’t Care Don’t Care
HoldSetup
D
Unknown
Clock-to-Q
Q
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.11
Clocking Methodology
Clk
Combination Logic.
.
.
.
.
.
.
.
.
.
.
.
° All storage elements are clocked by the same clock edge
° The combination logic blocks:• Inputs are updated at each clock tick
• All outputs MUST be stable before the next clock tick
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.12
Critical Path & Cycle Time
Clk
.
.
.
.
.
.
.
.
.
.
.
.
° Critical path: the slowest path between any two storage devices
° Cycle time is a function of the critical path
° must be greater than:
Clock-to-Q + Longest Path through Combination Logic + Setup
Combinational Logic
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Flip Flops have internal delays ...
D Q
CLK
Value of D is sampled on positive clock edge.
Q outputs sampled value for rest of cycle.
D
Q
t_setup
t_clk-to-Q
Where do Flip Flop delays come from? Wait for VLSI lectures.
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Flip-Flop delays eat into “time budget”1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8
!"#$%&'
( )#*#&&'&+,-+.'*/#&+0-12'*,'*3+
#
4 5+! ,/$'60&7"89+:+,/$'6$;"9+:+,/$'6.',;%9
5+! #0&7"8 :+#$;" :+#.',;%
0&7
ALU “time budget”
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8
!"#"$%&'()*"&')+',-#./$)#)01'23$.034
5 6#'7"#"$%&8'+)$'.)$$".4')9"$%43)#:
+)$'%&&'9%4/1;
5 <)='*)'="'"#0>"$%4"'!""#$!%&'?
@ A#-'.3$.034'3#904')$'$"7314"$')04904'4)'%#-'$"7314"$'3#904')$'.3$.034'
)04904;
@ B1"409'43>"C'+)$'.3$.034')049041'*"9"#*1')#'=/%4'34'.)##".41'4)
@ B.&DEF'43>"C'+)$'.3$.034'3#9041'*"9"#*1')#'+$)>'=/"$"'34'.)>"1;
$"7 $"72G 2G
.&).D 3#904
)04904
)943)#'+""*H%.D
3#904 )04904
I'! 43>"J.&D"FK'L'43>"J2GK'L'43>"J1"409K
I'! #.&D"F L'#2G L'#1"409
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.9
General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• Input load factor of each input
• Propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.10
Storage Element’s Timing Model
Clk
D Q
° Setup Time: Input must be stable BEFORE trigger clock edge
° Hold Time: Input must REMAIN stable after trigger clock edge
° Clock-to-Q time:
• Output cannot change instantaneously at the trigger clock edge
• Similar to delay in logic gates, two components:
- Internal Clock-to-Q
- Load dependent Clock-to-Q
Don’t Care Don’t Care
HoldSetup
D
Unknown
Clock-to-Q
Q
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.11
Clocking Methodology
Clk
Combination Logic.
.
.
.
.
.
.
.
.
.
.
.
° All storage elements are clocked by the same clock edge
° The combination logic blocks:• Inputs are updated at each clock tick
• All outputs MUST be stable before the next clock tick
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.12
Critical Path & Cycle Time
Clk
.
.
.
.
.
.
.
.
.
.
.
.
° Critical path: the slowest path between any two storage devices
° Cycle time is a function of the critical path
° must be greater than:
Clock-to-Q + Longest Path through Combination Logic + Setup
Combinational Logic
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Clock skew also eats into “time budget”
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&'%()&*$#+,-.
/ 01&$"#$%&2(34#5&6&7&6!896:(,;296$"%!<=&$43$;4,&)4""&1>4"-
/ 6?(3(1#3(@
A-&!#+,3#"&$"#$%&:%()
>.&!>3(1;"&$"#$%&54:,34B;,4#+-&&CD;>"4E(&2>,?&5(">F&13#G&$"#$%&:#;3$(&,#&>""&$"#$%&"#>5:&BF&$#+,3#""4+H&)43(:&5(">F&>+5&B;11(3&5(">F-
B.&5#+I,&JH>,(K&$"#$%:-
L-&6&" 6!896:(,;296$"%!<&9&)#3:,&$>:(&:%()-
/ M#:,&G#5(3+&">3H(&?4H?N2(31#3G>+$(&$?42:&*G4$3#23#$(::#3:.&$#+,3#"&(+5&,#&(+5&$"#$%&:%()&,#&>&1()&,(+,?:&>&+>+#:($#+5-
$"#$%&:%()=&5(">F&4+&54:,34B;,4#+
!8
!8O!8OI
!8O
!8OI
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&'%()&*$#+,-.
/ 0#,(&1(2(13(4&5677(1-
/ 8+&,9:3&$;3(<&$"#$%&3%()&;$,6;""=&>1#2:4(3&!"#$%&#'(! *;443&
,#&,9(&(77($,:2(&$"#$%&>(1:#4.-
/ ?9:3&(77($,&9;3&5((+&63(4&,#&9(">&16+&$:1$6:,3&;3&9:@9(1&
$"#$%&1;,(3-&&A:3%=&563:+(33B
!C
!CD
!CDE
$"#$%&3%()<&4(";=&:+&4:3,1:56,:#+
!CD
!CDE
As T →0, which circuit
fails first?
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&'%()&*$#+,-.
/ 01&$"#$%&2(34#5&6&7&6!896:(,;296$"%!<=&$43$;4,&)4""&1>4"-
/ 6?(3(1#3(@
A-&!#+,3#"&$"#$%&:%()
>.&!>3(1;"&$"#$%&54:,34B;,4#+-&&CD;>"4E(&2>,?&5(">F&13#G&$"#$%&:#;3$(&,#&>""&$"#$%&"#>5:&BF&$#+,3#""4+H&)43(:&5(">F&>+5&B;11(3&5(">F-
B.&5#+I,&JH>,(K&$"#$%:-
L-&6&" 6!896:(,;296$"%!<&9&)#3:,&$>:(&:%()-
/ M#:,&G#5(3+&">3H(&?4H?N2(31#3G>+$(&$?42:&*G4$3#23#$(::#3:.&$#+,3#"&(+5&,#&(+5&$"#$%&:%()&,#&>&1()&,(+,?:&>&+>+#:($#+5-
$"#$%&:%()=&5(">F&4+&54:,34B;,4#+
!8
!8O!8OI
!8O
!8OI
CLKd CLKd
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&'%()&*$#+,-.
/ 01&$"#$%&2(34#5&6&7&6!896:(,;296$"%!<=&$43$;4,&)4""&1>4"-
/ 6?(3(1#3(@
A-&!#+,3#"&$"#$%&:%()
>.&!>3(1;"&$"#$%&54:,34B;,4#+-&&CD;>"4E(&2>,?&5(">F&13#G&$"#$%&:#;3$(&,#&>""&$"#$%&"#>5:&BF&$#+,3#""4+H&)43(:&5(">F&>+5&B;11(3&5(">F-
B.&5#+I,&JH>,(K&$"#$%:-
L-&6&" 6!896:(,;296$"%!<&9&)#3:,&$>:(&:%()-
/ M#:,&G#5(3+&">3H(&?4H?N2(31#3G>+$(&$?42:&*G4$3#23#$(::#3:.&$#+,3#"&(+5&,#&(+5&$"#$%&:%()&,#&>&1()&,(+,?:&>&+>+#:($#+5-
$"#$%&:%()=&5(">F&4+&54:,34B;,4#+
!8
!8O!8OI
!8O
!8OICLKd
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Some Flip Flops have “hold” time ...
D
t_setup
CLK
t_hold
D must stay
stable here
D Q
CLK
Does flip-flop hold time affect operation of this circuit? Under what conditions?
t_inv
What is the intended function of this circuit?
t_clk-to-Q + t_inv > t_holdFor correct operation.
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Searching for processor critical path1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Timing AnalysisWhat is the
smallest T that produces correct
operation?Must considerall connectedregister pairs.
?
Why might I suspect this one?
UC Regents Fall 2006 © UCBCS 152 L5: Timing
32rd1
RegFile
32rd2
WE32wd
5rs1
5rs2
5ws
ExtRegDest
ALUsrcExtOp
ALUctr
32A
L
U
32
32
op
MemToReg
32Dout
Data Memory
WE32
Din
Addr
MemWr
Equal
RegWr
Equal
Control Lines
Combinational Logic
Clk
32
Addr Data
Instr
Mem
32D
PC
Q
32
32
+
32
32
0x4
PCSrc
32
+
32
CS 152 L06 Single Cycle 1 (6) UC Regents Fall 2004 © UCB
Step 1a: The MIPS-lite Subset for today
° ADD and SUB• addU rd, rs, rt• subU rd, rs, rt
° OR Immediate:• ori rt, rs, imm16
° LOAD and STORE Word• lw rt, rs, imm16• sw rt, rs, imm16
° BRANCH:• beq rs, rt, imm16
op rs rt rd shamt funct061116212631
6 bits 6 bits5 bits5 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
E
x
t
e
n
d
Searching for processor critical path
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Real Stuff: Timing Estimation, Closure
Timing EstimationPredicting a
processor’s clock rate early in the
project
From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Real Stuff: Timing Estimation, Closure
Timing ClosureMeeting
(or exceeding!) the timing estimate
From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Real Stuff: Timing Estimation, Closure
From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.
netlist. Of these, 121 713 were top-level chip global nets,and 21 711 were processor-core-level global nets. Againstthis model 3.5 million setup checks were performed in latemode at points where clock signals met data signals inlatches or dynamic circuits. The total number of timingchecks of all types performed in each chip run was9.8 million. Depending on the configuration of the timingrun and the mix of actual versus estimated design data,the amount of real memory required was in the rangeof 12 GB to 14 GB, with run times of about 5 to 6 hoursto the start of timing-report generation on an RS/6000*Model S80 configured with 64 GB of real memory.Approximately half of this time was taken up by readingin the netlist, timing rules, and extracted RC networks, as
well as building and initializing the internal data structuresfor the timing model. The actual static timing analysistypically took 2.5–3 hours. Generation of the entirecomplement of reports and analysis required an additional5 to 6 hours to complete. A total of 1.9 GB of timingreports and analysis were generated from each chip timingrun. This data was broken down, analyzed, and organizedby processor core and GPS, individual unit, and, in thecase of timing contracts, by unit and macro. This was onecomponent of the 24-hour-turnaround time achieved forthe chip-integration design cycle. Figure 26 shows theresults of iterating this process: A histogram of the finalnominal path delays obtained from static timing for thePOWER4 processor.
The POWER4 design includes LBIST and ABIST(Logic/Array Built-In Self-Test) capability to enable full-frequency ac testing of the logic and arrays. Such testingon pre-final POWER4 chips revealed that several circuitmacros ran slower than predicted from static timing. Thespeed of the critical paths in these macros was increasedin the final design. Typical fast ac LBIST laboratory testresults measured on POWER4 after these paths wereimproved are shown in Figure 27.
SummaryThe 174-million-transistor !1.3-GHz POWER4 chip,containing two microprocessor cores and an on-chipmemory subsystem, is a large, complex, high-frequencychip designed by a multi-site design team. Theperformance and schedule goals set at the beginning ofthe project were met successfully. This paper describesthe circuit and physical design of POWER4, emphasizingaspects that were important to the project’s success in theareas of design methodology, clock distribution, circuits,power, integration, and timing.
Figure 25
POWER4 timing flow. This process was iterated daily during the physical design phase to close timing.
VIM
Timer files ReportsAsserts
Spice
Spice
GL/1
Reports
< 12 hr
< 12 hr
< 12 hr
< 48 hr
< 24 hr
Non-uplift timing
Noiseimpacton timing
Upliftanalysis
Capacitanceadjust
Chipbench /EinsTimer
Chipbench /EinsTimer
Extraction
Core or chipwiring
Analysis/update(wires, buffers)
Notes:• Executed 2–3 months prior to tape-out• Fully extracted data from routed designs • Hierarchical extraction• Custom logic handled separately • Dracula • Harmony• Extraction done for • Early • Late
Extracted units (flat or hierarchical)Incrementally extracted RLMsCustom NDRsVIMs
Figure 26
Histogram of the POWER4 processor path delays.
!40 !20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280Timing slack (ps)
Lat
e-m
ode
timin
g ch
ecks
(th
ousa
nds)
0
50
100
150
200
IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. D. WARNOCK ET AL.
47
Most wires have hundreds of picoseconds to spare.The critical path
UC Regents Fall 2006 © UCBCS 152 L5: Timing
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Real Stuff: Floorplanning Intel XScale 80200
CS 152 L5: Timing UC Regents Fall 2006 © UCB
Administrivia: Upcoming deadlines ...
Friday 9/15: “ModelSim Checkoff”, in section, 125 Cory.
Monday 9/25: Lab 2 final report due via the submit program, 11:59 PM.
Friday 9/22: “Xilinx Checkoff”, in section, 125 Cory.
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Office Hours, Mid-terms ...
Mid-term 1: Tuesday October 3th,6:00 to 9:00 PM, TBA.Mid-term 2: Tuesday December 5th,6:00 to 9:00 PM, TBA.
Card Key Woes? Go to the office you handed your form into and ask why. Let me know what they say ...
Udam: MW 6-7 PM, 125 CoryJue: TTh 3-4 PM, 125 CoryJohn: TTh 10-11AM, 315 Soda
CS 152 L5: Timing UC Regents Fall 2006 © UCB
Timing in Xilinx Designs
Spartan-3 FPGA Family: Introduction and Ordering Information
4 www.xilinx.com DS099-1 (v1.3) July 13, 20041-800-255-7778 Preliminary Product Specification
6
R
Package Marking
Table 3: Spartan-3 I/O Chart
Device
Available User I/Os and Differential (Diff) I/O Pairs
VQ100VQG100
TQ144TQG144
PQ208PQG208
FT256FTG256
FG320FGG320
FG456FGG456
FG676FGG676
FG900FGG900
FG1156FGG1156
User Diff User Diff User Diff User Diff User Diff User Diff User Diff User Diff User Diff
XC3S50 63 29 97 46 124 56 - - - - - - - - - - - -
XC3S200 63 29 97 46 141 62 173 76 - - - - - - - - - -
XC3S400 - - 97 46 141 62 173 76 221 100 264 116 - - - - - -
XC3S1000 - - - - - - 173 76 221 100 333 149 391 175 - - - -
XC3S1500 - - - - - - - - 221 100 333 149 487 221 - - - -
XC3S2000 - - - - - - - - - - - - 489 221 565 270 - -
XC3S4000 - - - - - - - - - - - - - - 633 300 712 312
XC3S5000 - - - - - - - - - - - - - - 633 300 784 344
Notes: 1. All device options listed in a given package column are pin-compatible.
Lot Code
Date CodeXC3S50TM
PQ208xxx0350xxxxxxxxx4C
SPARTAN
Device TypePackage
Speed Grade
Temperature Range
R
R
ds099-1_03_071304
CS 152 L5: Timing UC Regents Fall 2006 © UCB
Prior Art for FPGAs ...
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Xilinx: Large Logic Array + Block RAM
Spartan-3 FPGA Family: Introduction and Ordering Information
2 www.xilinx.com DS099-1 (v1.3) July 13, 20041-800-255-7778 Preliminary Product Specification
6
R
Architectural OverviewThe Spartan-3 family architecture consists of five funda-mental programmable functional elements:
• Configurable Logic Blocks (CLBs) contain RAM-basedLook-Up Tables (LUTs) to implement logic and storageelements that can be used as flip-flops or latches.CLBs can be programmed to perform a wide variety oflogical functions as well as to store data.
• Input/Output Blocks (IOBs) control the flow of databetween the I/O pins and the internal logic of thedevice. Each IOB supports bidirectional data flow plus3-state operation. Twenty-four different signalstandards, including seven high-performancedifferential standards, are available as shown inTable 2. Double Data-Rate (DDR) registers areincluded. The Digitally Controlled Impedance (DCI)feature provides automatic on-chip terminations,simplifying board designs.
• Block RAM provides data storage in the form of 18-Kbitdual-port blocks.
• Multiplier blocks accept two 18-bit binary numbers asinputs and calculate the product.
• Digital Clock Manager (DCM) blocks provideself-calibrating, fully digital solutions for distributing,delaying, multiplying, dividing, and phase shifting clocksignals.
These elements are organized as shown in Figure 1. A ringof IOBs surrounds a regular array of CLBs. The XC3S50has a single column of block RAM embedded in the array.Those devices ranging from the XC3S200 to the XC3S2000have two columns of block RAM. The XC3S4000 andXC3S5000 devices have four RAM columns. Each columnis made up of several 18K-bit RAM blocks; each block isassociated with a dedicated multiplier. The DCMs are posi-tioned at the ends of the outer block RAM columns.
The Spartan-3 family features a rich network of traces andswitches that interconnect all five functional elements,transmitting signals among them. Each functional elementhas an associated switch matrix that permits multiple con-nections to the routing.
Figure 1: Spartan-3 Family Architecture
DS099-1_01_032703
Notes: 1. The two additional block RAM columns of the XC3S4000 and XC3S5000
devices are shown with dashed lines. The XC3S50 has only the block RAM column on the far left.
From: Xilinx Spartan 3 data sheet, modifiedto approximateVirtex architecture.
CLB == Configurable Logic Block“Swiss Army Knife” part
I/O Block (off-chip)
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Blades in the CLB “Swiss Army Knife”
Virtex™-E 1.8 V Field Programmable Gate ArraysR
Module 2 of 4 www.xilinx.com DS022-2 (v2.4) July 17, 20024 1-800-255-7778 Production Product Specification
Storage ElementsThe storage elements in the Virtex-E slice can be config-ured either as edge-triggered D-type flip-flops or aslevel-sensitive latches. The D inputs can be driven either by
the function generators within the slice or directly from sliceinputs, bypassing the function generators.
In addition to Clock and Clock Enable signals, each Slicehas synchronous set and reset signals (SR and BY). SR
Figure 4: 2-Slice Virtex-E CLB
F1
F2
F3
F4
G1
G2
G3
G4
Carry &Control
Carry &Control
Carry &Control
Carry &Control
LUT
CINCIN
COUT COUT
YQ
XQXQ
YQ
X
XB
YYBYB
Y
BX
BY
BX
BY
G1
G2
G3
G4
F1
F2
F3
F4
Slice 1 Slice 0
XB
X
LUTLUT
LUT DCE
Q
RC
SP
DCE
Q
RC
SP
DCE
Q
RC
SP
DCE
Q
RC
SP
ds022_04_121799
Figure 5: Detailed View of Virtex-E Slice
BY
F5IN
SRCLKCE
BX
YB
Y
YQ
XB
X
XQ
G4G3G2G1
F4F3F2F1
CIN
0
1
1
0
F5 F5
ds022_05_092000
COUT
CY
DCE
Q
DCE
Q
F6
CK WSO
WSHWEA4
BY DG
BX DI
DI
O
WEI3I2I1I0
LUT
CY
I3I2I1I0
O
DIWE
LUT
INIT
INIT
REV
REV
Edge triggeredflip-flip
Virtex™-E 1.8 V Field Programmable Gate ArraysR
Module 2 of 4 www.xilinx.com DS022-2 (v2.4) July 17, 20024 1-800-255-7778 Production Product Specification
Storage ElementsThe storage elements in the Virtex-E slice can be config-ured either as edge-triggered D-type flip-flops or aslevel-sensitive latches. The D inputs can be driven either by
the function generators within the slice or directly from sliceinputs, bypassing the function generators.
In addition to Clock and Clock Enable signals, each Slicehas synchronous set and reset signals (SR and BY). SR
Figure 4: 2-Slice Virtex-E CLB
F1
F2
F3
F4
G1
G2
G3
G4
Carry &Control
Carry &Control
Carry &Control
Carry &Control
LUT
CINCIN
COUT COUT
YQ
XQXQ
YQ
X
XB
YYBYB
Y
BX
BY
BX
BY
G1
G2
G3
G4
F1
F2
F3
F4
Slice 1 Slice 0
XB
X
LUTLUT
LUT DCE
Q
RC
SP
DCE
Q
RC
SP
DCE
Q
RC
SP
DCE
Q
RC
SP
ds022_04_121799
Figure 5: Detailed View of Virtex-E Slice
BY
F5IN
SRCLKCE
BX
YB
Y
YQ
XB
X
XQ
G4G3G2G1
F4F3F2F1
CIN
0
1
1
0
F5 F5
ds022_05_092000
COUT
CY
DCE
Q
DCE
Q
F6
CK WSO
WSHWEA4
BY DG
BX DI
DI
O
WEI3I2I1I0
LUT
CY
I3I2I1I0
O
DIWE
LUT
INIT
INIT
REV
REV
Virtex™-E 1.8 V Field Programmable Gate ArraysR
Module 2 of 4 www.xilinx.com DS022-2 (v2.4) July 17, 20024 1-800-255-7778 Production Product Specification
Storage ElementsThe storage elements in the Virtex-E slice can be config-ured either as edge-triggered D-type flip-flops or aslevel-sensitive latches. The D inputs can be driven either by
the function generators within the slice or directly from sliceinputs, bypassing the function generators.
In addition to Clock and Clock Enable signals, each Slicehas synchronous set and reset signals (SR and BY). SR
Figure 4: 2-Slice Virtex-E CLB
F1
F2
F3
F4
G1
G2
G3
G4
Carry &Control
Carry &Control
Carry &Control
Carry &Control
LUT
CINCIN
COUT COUT
YQ
XQXQ
YQ
X
XB
YYBYB
Y
BX
BY
BX
BY
G1
G2
G3
G4
F1
F2
F3
F4
Slice 1 Slice 0
XB
X
LUTLUT
LUT DCE
Q
RC
SP
DCE
Q
RC
SP
DCE
Q
RC
SP
DCE
Q
RC
SP
ds022_04_121799
Figure 5: Detailed View of Virtex-E Slice
BY
F5IN
SRCLKCE
BX
YB
Y
YQ
XB
X
XQ
G4G3G2G1
F4F3F2F1
CIN
0
1
1
0
F5 F5
ds022_05_092000
COUT
CY
DCE
Q
DCE
Q
F6
CK WSO
WSHWEA4
BY DG
BX DI
DI
O
WEI3I2I1I0
LUT
CY
I3I2I1I0
O
DIWE
LUT
INIT
INIT
REV
REV
Adder carry chain, multiplier step,LUT expansion logic.
LUTboxcanalsoturnintoRAMor ashiftregisterchain
1
1
1
1
1
1
1
1
1
1
example g(F1, F2, F3, F4): F1 ^ F2 ^ F3 ^ F4
Look Up Table (LUT)
g()
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Inside a LUT ...
Virtex™-E 1.8 V Field Programmable Gate ArraysR
Module 2 of 4 www.xilinx.com DS022-2 (v2.4) July 17, 20024 1-800-255-7778 Production Product Specification
Storage ElementsThe storage elements in the Virtex-E slice can be config-ured either as edge-triggered D-type flip-flops or aslevel-sensitive latches. The D inputs can be driven either by
the function generators within the slice or directly from sliceinputs, bypassing the function generators.
In addition to Clock and Clock Enable signals, each Slicehas synchronous set and reset signals (SR and BY). SR
Figure 4: 2-Slice Virtex-E CLB
F1
F2
F3
F4
G1
G2
G3
G4
Carry &Control
Carry &Control
Carry &Control
Carry &Control
LUT
CINCIN
COUT COUT
YQ
XQXQ
YQ
X
XB
YYBYB
Y
BX
BY
BX
BY
G1
G2
G3
G4
F1
F2
F3
F4
Slice 1 Slice 0
XB
X
LUTLUT
LUT DCE
Q
RC
SP
DCE
Q
RC
SP
DCE
Q
RC
SP
DCE
Q
RC
SP
ds022_04_121799
Figure 5: Detailed View of Virtex-E Slice
BY
F5IN
SRCLKCE
BX
YB
Y
YQ
XB
X
XQ
G4G3G2G1
F4F3F2F1
CIN
0
1
1
0
F5 F5
ds022_05_092000
COUT
CY
DCE
Q
DCE
Q
F6
CK WSO
WSHWEA4
BY DG
BX DI
DI
O
WEI3I2I1I0
LUT
CY
I3I2I1I0
O
DIWE
LUT
INIT
INIT
REV
REV
INPUTS 1
11
1
1
!"#$%&'())* ++,!-.)'/ 012)*/3456 47&1'-8
!"#$%&$'()(*%+$+,'-.$'%/(
0 1)$)2+3/$%&$%$4-*(./$-56+(5()/%/-,)$
,7$%$73)./-,)$!"#!$%!&'()8
0 9%.:$+%/.:$+,.%/-,)$:,+4&$/:($;%+3($
,7$/:($73)./-,)$.,**(&6,)4-)'$/,$
,)($-)63/$.,5<-)%/-,)8
====$$$$>?=@=@=@=A===B$$$$>?=@=@=@BA==B=$$$$>?=@=@B@=A==BB$$$$>?=@=@B@BA==BB=B===B=B=BB==BBBB===B==BB=B=B=BBBB==BB=BBBB=BBBB
CDE"#F
&/,*($-)$B&/$+%/.:
&/,*($-)$G)4$+%/.:
!"#$%&'()*+&,-
!"#$%&'().+&,-
HI1DJCDE"#F
BB$$$$$B$$$$$BB=$$$$$=$$$$$B=B$$$$$=$$$$$B==$$$$$=$$$$$=
C56+(5()/&$#/0 73)./-,)$,7$G$-)63/&8$$
K,L$5%)M$,7$/:(&($$%*($/:(*(N
K,L$5%)M$73)./-,)&$,7$)$-)63/&N
gg
gg
!"#$%&'())* ++,!-.)'/ 012)*/3456 47&1'-*
!"#$%&'()*+(+,-.-/0,1 ,"2/-&#$%&/3&/()*+(+,-+4&.3&.&5,
6&7&(+(089:
; /,)<-3&=>003+&0,+&0?&5,&(+(089&
*0=.-/0,3@
; (+(089&*0=.-/0,3&A*.-=>+3B&.8+&
,08(.**9&*0.4+4&C/->&D.*<+3&?80(&
<3+8E3&=0,?/F<8.-/0,&2/-&3-8+.(@
; ',)<-3&-0&(<6&=0,-80*&.8+&->+&
G#H&/,)<-3@
1 I+3<*-&/3&.&F+,+8.*&)<8)03+&
J*0F/=&F.-+K@&&
; ,"#$%&=.,&/()*+(+,-&!"#
?<,=-/0,&0?&,&/,)<-3L
*.-=>
*.-=>
*.-=>
*.-=>
7M&6&7
(<67M
'NO$%P
Q$%O$%
#.-=>+3&)80F8.((+4&.3&).8-0?&=0,?/F<8.-/0,&2/-"3-8+.(
FF
FF
FF
FF
!"#$%&'())* ++,!-.)'/ 012)*/3456 47&1'-*
!"#$%&'()*+(+,-.-/0,1 ,"2/-&#$%&/3&/()*+(+,-+4&.3&.&5,
6&7&(+(089:
; /,)<-3&=>003+&0,+&0?&5,&(+(089&
*0=.-/0,3@
; (+(089&*0=.-/0,3&A*.-=>+3B&.8+&
,08(.**9&*0.4+4&C/->&D.*<+3&?80(&
<3+8E3&=0,?/F<8.-/0,&2/-&3-8+.(@
; ',)<-3&-0&(<6&=0,-80*&.8+&->+&
G#H&/,)<-3@
1 I+3<*-&/3&.&F+,+8.*&)<8)03+&
J*0F/=&F.-+K@&&
; ,"#$%&=.,&/()*+(+,-&!"#
?<,=-/0,&0?&,&/,)<-3L
*.-=>
*.-=>
*.-=>
*.-=>
7M&6&7
(<67M
'NO$%P
Q$%O$%
#.-=>+3&)80F8.((+4&.3&).8-0?&=0,?/F<8.-/0,&2/-"3-8+.(
1
11
11
1
1
1
1
Part of a FF “scan chain”
To next FF in chain ...
...
Set during configuration.
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Key things to remember ...
How to learn what to do: read the Synplicity and Xilinx documentation, try small examples, look at CAD tool log files and output, ask the TAs.
The way you structure your design (and your Verilog) can make logic mapping “better” (denser, faster).
CAD tools choose mapping from Verilog to CLB resources.
UC Regents Fall 2006 © UCBCS 152 L5: Timing
After routing ...
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Xilinx: Large Array of CLBs, plus RAM
Spartan-3 FPGA Family: Introduction and Ordering Information
2 www.xilinx.com DS099-1 (v1.3) July 13, 20041-800-255-7778 Preliminary Product Specification
6
R
Architectural OverviewThe Spartan-3 family architecture consists of five funda-mental programmable functional elements:
• Configurable Logic Blocks (CLBs) contain RAM-basedLook-Up Tables (LUTs) to implement logic and storageelements that can be used as flip-flops or latches.CLBs can be programmed to perform a wide variety oflogical functions as well as to store data.
• Input/Output Blocks (IOBs) control the flow of databetween the I/O pins and the internal logic of thedevice. Each IOB supports bidirectional data flow plus3-state operation. Twenty-four different signalstandards, including seven high-performancedifferential standards, are available as shown inTable 2. Double Data-Rate (DDR) registers areincluded. The Digitally Controlled Impedance (DCI)feature provides automatic on-chip terminations,simplifying board designs.
• Block RAM provides data storage in the form of 18-Kbitdual-port blocks.
• Multiplier blocks accept two 18-bit binary numbers asinputs and calculate the product.
• Digital Clock Manager (DCM) blocks provideself-calibrating, fully digital solutions for distributing,delaying, multiplying, dividing, and phase shifting clocksignals.
These elements are organized as shown in Figure 1. A ringof IOBs surrounds a regular array of CLBs. The XC3S50has a single column of block RAM embedded in the array.Those devices ranging from the XC3S200 to the XC3S2000have two columns of block RAM. The XC3S4000 andXC3S5000 devices have four RAM columns. Each columnis made up of several 18K-bit RAM blocks; each block isassociated with a dedicated multiplier. The DCMs are posi-tioned at the ends of the outer block RAM columns.
The Spartan-3 family features a rich network of traces andswitches that interconnect all five functional elements,transmitting signals among them. Each functional elementhas an associated switch matrix that permits multiple con-nections to the routing.
Figure 1: Spartan-3 Family Architecture
DS099-1_01_032703
Notes: 1. The two additional block RAM columns of the XC3S4000 and XC3S5000
devices are shown with dashed lines. The XC3S50 has only the block RAM column on the far left.
pluswires
From: Xilinx Spartan 3 data sheet, simplified.
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Spartan-3 FPGA Family: Introduction and Ordering Information
2 www.xilinx.com DS099-1 (v1.3) July 13, 20041-800-255-7778 Preliminary Product Specification
6
R
Architectural Overview
The Spartan-3 family architecture consists of five funda-mental programmable functional elements:
• Configurable Logic Blocks (CLBs) contain RAM-basedLook-Up Tables (LUTs) to implement logic and storageelements that can be used as flip-flops or latches.CLBs can be programmed to perform a wide variety oflogical functions as well as to store data.
• Input/Output Blocks (IOBs) control the flow of databetween the I/O pins and the internal logic of thedevice. Each IOB supports bidirectional data flow plus3-state operation. Twenty-four different signalstandards, including seven high-performancedifferential standards, are available as shown inTable 2. Double Data-Rate (DDR) registers areincluded. The Digitally Controlled Impedance (DCI)feature provides automatic on-chip terminations,simplifying board designs.
• Block RAM provides data storage in the form of 18-Kbitdual-port blocks.
• Multiplier blocks accept two 18-bit binary numbers asinputs and calculate the product.
• Digital Clock Manager (DCM) blocks provideself-calibrating, fully digital solutions for distributing,delaying, multiplying, dividing, and phase shifting clocksignals.
These elements are organized as shown in Figure 1. A ringof IOBs surrounds a regular array of CLBs. The XC3S50has a single column of block RAM embedded in the array.Those devices ranging from the XC3S200 to the XC3S2000have two columns of block RAM. The XC3S4000 andXC3S5000 devices have four RAM columns. Each columnis made up of several 18K-bit RAM blocks; each block isassociated with a dedicated multiplier. The DCMs are posi-tioned at the ends of the outer block RAM columns.
The Spartan-3 family features a rich network of traces andswitches that interconnect all five functional elements,transmitting signals among them. Each functional elementhas an associated switch matrix that permits multiple con-nections to the routing.
Figure 1: Spartan-3 Family Architecture
DS099-1_01_032703
Notes:
1. The two additional block RAM columns of the XC3S4000 and XC3S5000 devices are shown with dashed lines. The XC3S50 has only the block RAM column on the far left.
Why Xilinx wires are so slow ...Wires are slow because (1) each green dot is a transistor switch (2) path may not be shortest length (3) all wires are too long!
The best Xilinx users “write Verilog to the grid”. When Xilinx designs FPGA chips, wiring channels are optimized for (2) & (3).
Connect this
To this
UC Regents Fall 2006 © UCBCS 152 L5: Timing
What are the green dots?
!"#$%&'())*++,!-.)'/012)*/345647&1'--
!"#$%&$'($)**)+,-,./
01).23#"%)$#%4"#5%.'6
78%*)9#%'$%+$#)9%2$'"":;',<.%
2'<<#2.,'<"%,<%.3#%,<.#$2'<<#2.
=8%5#>,<#%.3#%>4<2.,'<%'>%.3#%-'(,2%
+-'29"
?8%"#.%4"#$%';.,'<"6
89$:;$%':;1'<=&$2'><=2?@
8$%':;1'$%"A:B=A:"A:'><=2?@
8&<=>7<'#1@1:B2<=2?
0@A'<>,(4$).,'<%+,.%".$#)*B%2)<%
+#%-')5#5%4<5#$%4"#$%2'<.$'-6
CD--%-).23#"%)$#%".$4<(%.'(#.3#$%
,<%)%"3,>.%23),<6
01).23:+)"#5%EF,-,<GH%D-.#$)H%IJ
K$#2'<>,(4$)+-#
CL'-).,-#
C$#-).,L#-/%-)$(#8
-).23FFA “cross-point connection”
!"#$%&'())* ++,!-.)'/ 012)*/3456 47&1'-)
!"#$%&'()'*)+,-
. !'/)0)1-%+2%!"#$3-%4)221(%),5
6 789-):'0%/1',-%+2%)/701/1,*),;%
<-1(%7(+;('//'=)0)*9>
6 '((',;1/1,*%+2%),*1(:+,,1:*)+,%
?)(1->%',4
6 *81%='-):%2<,:*)+,'0)*9%+2%*81%
0+;):%=0+:@-A
. B+-*%-);,)2):',*%4)221(1,:1%)-%),%
*81%/1*8+4%2+(%7(+C)4),;%201D)=01%
=0+:@-%',4%:+,,1:*)+,-5%
. $,*)E2<-1%='-14%F1D5%$:*10G
H I+,EC+0'*)01>%(10'*)C109%-/'00
6 2)D14%F,+,E(17(+;('//'=01G
Set during configuration.
One flip-flop and a pass gate for each switch point. In order to have enough wires in the channels to wire up CLBs for most circuits, we need a lot of switch points! Thus, “80%+ of FPGA is for wiring”.
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Clocks have dedicated wires (low skew)
Spartan-3 FPGA Family: Functional Description
30 www.xilinx.com DS099-2 (v1.3) August 24, 2004Preliminary Product Specification
40
R
width of the die. In turn, the horizontal spine branches out into a subsidiary clock interconnect that accesses the CLBs.
2. The clock input of either DCM on the same side of the die — top or bottom — as the BUFGMUX element in use.
A Global clock input is placed in a design using either aBUFGMUX element or the BUFG (Global Clock Buffer) ele-ment. For the purpose of minimizing the dynamic power dis-sipation of the clock network, the Xilinx developmentsoftware automatically disables all clock line segments thata design does not use.
Figure 18: Spartan-3 Clock Network (Top View)
4
4
4
4
4
4
4
8
8
4
4
88
Horizontal Spine
Top
Spi
neB
otto
m S
pine
4
DCM DCM
DCM DCM
Array Dependent
Array Dependent
•
•
•
•
•
•
•
•
•
•
•
•
DS099-2_18_070203
4 BUFGMUX
GCLK2GCLK3
GCLK0GCLK1
4 BUFGMUX
GCLK6 GCLK4GCLK7 GCLK5
From: Xilinx Spartan 3 data sheet. Virtex issimilar.
CS 152 L5: Timing UC Regents Fall 2006 © UCB
Diephoto:XilinxVirtex
Gold wiresare the clock tree.
CS 152 L5: Timing UC Regents Fall 2006 © UCB
the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [16] was required to tune themore than 2000 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3D visualizations. Figure 8 visualizes four ofthe 64 sector trees containing about 125 tuned wiresdriving 1/16th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [17, 18]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 15 200 clock pins on the chip,regardless of proximity.
From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.
Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed
Figure 6
Schematic diagram of global clock generation and distribution.
PLL
Bypass
Referenceclock in
Referenceclock out
Clock distributionClock out
Figure 7
3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.
Del
ayGrid
Tunedsectortrees
Sectorbuffers
Buffer level 2
Buffer level 1
y
x
Figure 8
Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.
Del
ay Multiple-fingeredtransmissionline
yx
J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002
32
Clock Tree Delays,
IBM “Power” CPU
Dela
y
CS 152 L5: Timing UC Regents Fall 2006 © UCB
the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [16] was required to tune themore than 2000 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3D visualizations. Figure 8 visualizes four ofthe 64 sector trees containing about 125 tuned wiresdriving 1/16th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [17, 18]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 15 200 clock pins on the chip,regardless of proximity.
From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.
Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed
Figure 6
Schematic diagram of global clock generation and distribution.
PLL
Bypass
Referenceclock in
Referenceclock out
Clock distributionClock out
Figure 7
3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.
Del
ay
Grid
Tunedsectortrees
Sectorbuffers
Buffer level 2
Buffer level 1
y
x
Figure 8
Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.
Del
ay Multiple-fingeredtransmissionline
yx
J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002
32
Clock Tree Delays, IBM Power
clock grid was completed with a tool run at the chip level,connecting unit-level pins to the grid. At this point, theclock tuning and the bottom-up clock routing process stillhave a great deal of flexibility to respond rapidly to evenlate changes. Repeated practice routing and tuning wereperformed by a small, focused global clock team as theclock pins and buffer placements evolved to guaranteefeasibility and speed the design process.
Measurements of jitter and skew can be carried outusing the I/Os on the chip. In addition, approximately 100top-metal probe pads were included for direct probingof the global clock grid and buffers. Results on actualPOWER4 microprocessor chips show long-distanceskews ranging from 20 ps to 40 ps (cf. Figure 9). This isimproved from early test-chip hardware, which showedas much as 70 ps skew from across-chip channel-lengthvariations [19]. Detailed waveforms at the input andoutput of each global clock buffer were also measuredand compared with simulation to verify the specializedmodeling used to design the clock grid. Good agreementwas found. Thus, we have achieved a “correct-by-design”clock-distribution methodology. It is based on our designexperience and measurements from a series of increasinglyfast, complex server microprocessors. This method resultsin a high-quality global clock without having to usefeedback or adjustment circuitry to control skews.
Circuit designThe cycle-time target for the processor was set early in theproject and played a fundamental role in defining thepipeline structure and shaping all aspects of the circuitdesign as implementation proceeded. Early on, criticaltiming paths through the processor were simulated indetail in order to verify the feasibility of the designpoint and to help structure the pipeline for maximumperformance. Based on this early work, the goal for therest of the circuit design was to match the performance setduring these early studies, with custom design techniquesfor most of the dataflow macros and logic synthesis formost of the control logic—an approach similar to thatused previously [20]. Special circuit-analysis and modelingtechniques were used throughout the design in order toallow full exploitation of all of the benefits of the IBMadvanced SOI technology.
The sheer size of the chip, its complexity, and thenumber of transistors placed some important constraintson the design which could not be ignored in the push tomeet the aggressive cycle-time target on schedule. Theseconstraints led to the adoption of a primarily static-circuitdesign strategy, with dynamic circuits used only sparinglyin SRAMs and other critical regions of the processor core.Power dissipation was a significant concern, and it was akey factor in the decision to adopt a predominantly static-circuit design approach. In addition, the SOI technology,
including uncertainties associated with the modelingof the floating-body effect [21–23] and its impact onnoise immunity [22, 24 –27] and overall chip decouplingcapacitance requirements [26], was another factor behindthe choice of a primarily static design style. Finally, thesize and logical complexity of the chip posed risks tomeeting the schedule; choosing a simple, robust circuitstyle helped to minimize overall risk to the projectschedule with most efficient use of CAD tool and designresources. The size and complexity of the chip alsorequired rigorous testability guidelines, requiring almostall cycle boundary latches to be LSSD-compatible formaximum dc and ac test coverage.
Another important circuit design constraint was thelimit placed on signal slew rates. A global slew rate limitequal to one third of the cycle time was set and enforcedfor all signals (local and global) across the whole chip.The goal was to ensure a robust design, minimizingthe effects of coupled noise on chip timing and alsominimizing the effects of wiring-process variability onoverall path delay. Nets with poor slew also were foundto be more sensitive to device process variations andmodeling uncertainties, even where long wires and RCdelays were not significant factors. The general philosophywas that chip cycle-time goals also had to include theslew-limit targets; it was understood from the beginningthat the real hardware would function at the desiredcycle time only if the slew-limit targets were also met.
The following sections describe how these designconstraints were met without sacrificing cycle time. Thelatch design is described first, including a description ofthe local clocking scheme and clock controls. Then thecircuit design styles are discussed, including a description
Figure 9
Global clock waveforms showing 20 ps of measured skew.
1.5
1.0
0.5
0.0
0 500 1000 1500 2000 2500
20 ps skew
Vol
ts (
V)
Time (ps)
IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. D. WARNOCK ET AL.
33
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Key things to remember about FPGAs ...
Calinx Xilinx chip is large but not extremely large: 38,400 LUT + FF + adder carry chain, 655 kb block RAM.
Normal designs: critical path 80% wire delay, 20% LUT delays. The best designers can flip these percentages.
Xilinx wires are fake. The cross-points in the path make wire slow.
Tools: Global timing constraints, region locking.
UC Regents Fall 2006 © UCBCS 152 L5: Timing
The analogy holds up ...
CLBs are “real” elements, with real physics. Not a simulation of physics.
Configurability has a price: lower performance, wasted resources.
≃
CS 152 L5: Timing UC Regents Fall 2006 © UCB
Cisco Systemsproducts often use
FPGAs
How can FPGAs be improved to work
better for a particular type of product?
Masks/WafersTest & Engineering
Software
Design/Verification& Layout
$45M
$40M
$35M
$30M
$25M
$20M
$15M
The days are long gone when you could spend a few hundred thousand dollars and six months developing an ASIC, drop it into a piece of equipment, then sit back and watch it sell for years. Development costs have spiraled into the tens of millions. Development times are stretching into double-digit months. And the ASICs themselves have become so complicated, half of them have to go back to be re-spun, while another 30% have to take a second or even third trip back to the drawing board, adding another three months—minimum—to the schedule.
That would be bad enough, except that rapid time to market is more valuable than ever. And more expensive than ever should you fail to attain it. In a recent speech, no less an authority than John Chambers of Cisco observed that every four-week delay in product availability cost his company 14% market share.
Four weeks.Considering the pace of these markets, can you really
afford to build something really expensive, that you can’t change, and that isn’t going to be finished for two years?
Those are your unattractive choices, if you choose to go the ASIC route. And that’s the good news. The bad news is, it’s only going to get worse.
Increased complexity inevitably leads to increased costs. And in the evolving networking, telecom, wireless, and storage markets, complexity is always going to increase. The ASIC manufacturers’ answer to dealing with this increased complexity is to reduce their geometries. Theoretically, that’ll reduce costs.
But it doesn’t. In fact, it does just the opposite.
Risk is no longer an option. Or a necessity.
Example: cswitch, an FPGA startup.
CS 152 L5: Timing UC Regents Fall 2006 © UCB
For a custom chip to deliver high performance in all types of networks, it has to move, store, and edit packets at very high speeds. The Configurable Switch Array chip does just that. It’s the first configu-rable solution to deliver bandwidth at 40 to 100 Gbps for a range of applications, making it capable of moving up to 6 TBps of packets at speeds of up to 2 GHz. To handle editing tasks, the chip packs a rich assortment of Frame Header Parsers, Arithmetic Units, and RCAMs to edit and classify packets at 1 GHz speeds. For storing packets, the chip includes over 18 Mb of on-chip memory, as well as support for the latest high-speed memories, such as DDR2, RLDRAM2, and QDR2.
All of these elements are, of course, completely configurable by your engineers.
Configurability has always come at a price, with the trade-offs in low gate density, inadequate performance, or high power. Not any more. With over 7 million equivalent ASIC gates, speeds of up to 100 Gbps on chip, and the very latest in power management techniques, the Configurable Switch Array chip makes configurability worthwhile by making it uncompromisingly available.
For the very first time, the Configurable Switch Array chip brings the considerable advantages of high performance at low power to all kinds of networking applications. Using the latest advancements in power management—some of which were invented by our design team—the chip automatically reduces power by shutting off clocks to sectors not in use, and allows designers to vary chip voltage to achieve the optimum total power requirements. This holistic approach to power-managed performance represents a substantial and welcome breakthrough for equipment suppliers and customers alike.
Never before have so many resources been dedicated to your success.
Just like a normal FPGA, based on an array architecture ...
Xilinx-style Configurable Logic Blocks
Block R
AM
Block R
AM
Block R
AM
Block R
AM
CS 152 L5: Timing UC Regents Fall 2006 © UCB
For a custom chip to deliver high performance in all types of networks, it has to move, store, and edit packets at very high speeds. The Configurable Switch Array chip does just that. It’s the first configu-rable solution to deliver bandwidth at 40 to 100 Gbps for a range of applications, making it capable of moving up to 6 TBps of packets at speeds of up to 2 GHz. To handle editing tasks, the chip packs a rich assortment of Frame Header Parsers, Arithmetic Units, and RCAMs to edit and classify packets at 1 GHz speeds. For storing packets, the chip includes over 18 Mb of on-chip memory, as well as support for the latest high-speed memories, such as DDR2, RLDRAM2, and QDR2.
All of these elements are, of course, completely configurable by your engineers.
Configurability has always come at a price, with the trade-offs in low gate density, inadequate performance, or high power. Not any more. With over 7 million equivalent ASIC gates, speeds of up to 100 Gbps on chip, and the very latest in power management techniques, the Configurable Switch Array chip makes configurability worthwhile by making it uncompromisingly available.
For the very first time, the Configurable Switch Array chip brings the considerable advantages of high performance at low power to all kinds of networking applications. Using the latest advancements in power management—some of which were invented by our design team—the chip automatically reduces power by shutting off clocks to sectors not in use, and allows designers to vary chip voltage to achieve the optimum total power requirements. This holistic approach to power-managed performance represents a substantial and welcome breakthrough for equipment suppliers and customers alike.
Never before have so many resources been dedicated to your success.
Except some rows are specialized for network products ...
Packet Parser: Simple fast CPUs specialized for packet processing.
Specialized logic for computing packet checksums.
Content-addressable memory: “smart” memory for routing tables.
CS 152 L5: Timing UC Regents Fall 2006 © UCB
For a custom chip to deliver high performance in all types of networks, it has to move, store, and edit packets at very high speeds. The Configurable Switch Array chip does just that. It’s the first configu-rable solution to deliver bandwidth at 40 to 100 Gbps for a range of applications, making it capable of moving up to 6 TBps of packets at speeds of up to 2 GHz. To handle editing tasks, the chip packs a rich assortment of Frame Header Parsers, Arithmetic Units, and RCAMs to edit and classify packets at 1 GHz speeds. For storing packets, the chip includes over 18 Mb of on-chip memory, as well as support for the latest high-speed memories, such as DDR2, RLDRAM2, and QDR2.
All of these elements are, of course, completely configurable by your engineers.
Configurability has always come at a price, with the trade-offs in low gate density, inadequate performance, or high power. Not any more. With over 7 million equivalent ASIC gates, speeds of up to 100 Gbps on chip, and the very latest in power management techniques, the Configurable Switch Array chip makes configurability worthwhile by making it uncompromisingly available.
For the very first time, the Configurable Switch Array chip brings the considerable advantages of high performance at low power to all kinds of networking applications. Using the latest advancements in power management—some of which were invented by our design team—the chip automatically reduces power by shutting off clocks to sectors not in use, and allows designers to vary chip voltage to achieve the optimum total power requirements. This holistic approach to power-managed performance represents a substantial and welcome breakthrough for equipment suppliers and customers alike.
Never before have so many resources been dedicated to your success.
The I/O pins speak Ethernet and other network standards.
“SerDes” Serial-DeserializerLogic.
Converts serial data of Ethernet to parallel bytes at the serial “line-rate”.
This “slow, wide” parallel representation lets the FPGA keep up with 1 Gbit Ethernet.
MAC:“Media AccessControl” logic.
“Softwired” logic for the lowest layers of Ethernet -- can be configured for different standards.
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Timing Conclusions
Flip-flop delay: setup and clk-to-Q
Logic delay: fan-out and wires
Critical path limits clock period
Xilinx timing: mapping logic into CLBs, routing onto fake wires.
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Where we are now, and what is next
We have a top-down view of how signals move through a processor in time
How to pipeline ...
Why pipeline processors?Performance!