EE141
EECS151/251ASpring2020 DigitalDesignandIntegratedCircuitsInstructor:JohnWawrzynek
Lecture 19:Clock and Power Distribution
EE141
Outline Clock non-idealities Clock Distribution Power Distribution
2
EE141
Synchronous Timing - Review
EE141
Synchronous Timing
4
EE141
Register Timing Parameters
D
Clk
Q
D
Q
Clk
tclk-q
thold
T
tsetup
Output delays can be different for rising and falling data transitions
5
EE141
Timing Constraints
tclk-q.maxtclk-q.min
tlogic,maxtlogic,min
tsetup, thold
6
EE141
Cycle time (max): TClk > tclk-q,max + tlogic,max + tsetup
Race margin (min): thold < tclk-q,min + tlogic,min
Timing Constraints
tclk-q,maxtclk-q,min
tlogic,maxtlogic,min
tsetup, thold
7
EE141
Clock Nonidealities
EE141
Clock Nonidealities Clock skew: tSK Time difference between the sink (receiving) and
source (launching) clock edge; deterministic + random
Clock jitter Temporal variations in consecutive edges of the
clock signal; modulation + random noise Cycle-to-cycle (short-term) tJS
Long term tJL
Variation of the pulse width Important for level sensitive clocking
9
EE141
Clock UncertaintiesSources of clock uncertainty
10
EE141
Clock Skew and Jitter
Both skew and jitter affect the effective cycle time and the race margin
Clk
Clk’
tSK
tJS
11
EE141
Positive Skew
Launching edge arrives before the receiving edge 12
EE141
Negative Skew
Receiving edge arrives before the launching edge
CLK1
CLK2
TCLK
δ
TCLK - δ
2
1
4
3
13
EE141
Timing Constraints
14
Minimum cycle time: Tclk + δ = tclk-q,max + tsetup + tlogic,max
tclk-q,maxtclk-q,min
tlogic,maxtlogic,mintsetup, thold
Skew may be negative or positive
EE141
Timing Constraints
15
tclk-q,maxtclk-q,min
tlogic,maxtlogic,mintsetup, thold
Hold time constraint: t(clk-q,min) + t(logic,min) > thold + δ
Skew may be negative or positive
EE141
Jitter Contributes to Critical Path
16
Clk2tsetup
Latest point of launching
Earliest arrivalof next cycle
tJS2 - δ
Clk1
TCLK
tclk-Q,maxtlogic,max
tJS1
EE141
Clock Constraints in Edge-Triggered Systems
17
If launching edge is late and receiving edge is early, the data will not be too late if:
Minimum cycle time is determined by the maximum delays through the logic
tclk-q,max + tlogic,max + tsetup < TCLK – tJS,1 – tJS,2 + δ
tclk-q,max + tlogic,max + tsetup - δ + 2tJS < TCLK
Skew can be either positive or negative Jitter tJS usually expressed as peak-to-peak or n x RMS value
EE141
Datapath with Feedback
18
EE141
Clock Distribution
EE141
Clock Distribution
!20
Single clock generally used to synchronize all logic on the same chip (or region of chip) Need to distribute clock over the entire region While maintaining low skew/jitter And without burning too much power
EE141 21
EE141 22
EE141
Clock Distribution
23
What’s wrong with just routing wires to every point that needs a clock?
EE141
H-Tree
24
Equal wire length/number of buffers to get to every location
EE141
More realistic ASIC H-tree
25
[Restle98]
EE141
Clock Grid Alternative
Driver
Driver
Driver D
river
GCLK GCLK
GCLK
GCLK
•No RC matching •But huge power
26
EE141
Example: DEC Alpha 21164 (1995)
2 phase single wire clock, distributed globally
2 distributed drivers Reduced RC delay/skew Improved thermal distribution 3.75nF clock load, 20W power 58 cm final driver width
Local inverters for latching Conditional clocks in caches to
reduce power
trise = 350ps tskew = 150ps
tcycle= 3.3ns
Clock waveform
Location of clock driver on die
pre-driver
final drivers
27
EE141
28
Clock Drivers
EE141
Clock Skew in Alpha Processor
29
EE141
2 Phase, with multiple conditional buffered clocks 2.8 nF clock load 40 cm final driver width
Local clocks can be gated “off” to save power Reduced load/skew Reduced thermal issues Multiple clocks complicate race checking
trise = 350ps tskew = 50ps
tcycle= 1.67ns
EV6 (Alpha 21264) Clocking 600 MHz – 0.35 micron CMOS
Global clock waveform
30
EE141
21264 Clocking
31
EE141 32
EV6 Clock Results
GCLK Skew (at Vdd/2 Crossings)
ps 5
10 15 20 25 30 35 40 45 50
ps 300 305 310 315 320 325 330 335 340 345
GCLK Rise Times (20% to 80% Extrapolated to 0% to 100%)
UC Regents Fall 2013 © UCBCS 250 L3: Timing
the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [1 6 ] was required to tune themore than 2 0 0 0 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3 D visualizations. Figure 8 visualizes four ofthe 6 4 sector trees containing about 1 2 5 tuned wiresdriving 1 /1 6 th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [1 7 , 1 8 ]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 1 5 2 0 0 clock pins on the chip,regardless of proximity.
From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.
Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed
Figure 6
Schematic diagram of global clock generation and distribution.
PLL
Bypass
Referenceclock in
Referenceclock out
Clock distributionClock out
Figure 7
3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.
Del
ayGrid
Tunedsectortrees
Sectorbuffers
Buffer level 2
Buffer level 1
y
x
Figure 8
Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.
Del
ay Multiple-fingeredtransmissionline
yx
J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 4 6 NO. 1 JANUARY 2 0 0 2
32
Clock Tree Delays,
IBM “Power” CPU
Del
ay
33
UC Regents Fall 2013 © UCBCS 250 L3: Timing
the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [1 6 ] was required to tune themore than 2 0 0 0 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3 D visualizations. Figure 8 visualizes four ofthe 6 4 sector trees containing about 1 2 5 tuned wiresdriving 1 /1 6 th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [1 7 , 1 8 ]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 1 5 2 0 0 clock pins on the chip,regardless of proximity.
From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.
Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed
Figure 6
Schematic diagram of global clock generation and distribution.
PLL
Bypass
Referenceclock in
Referenceclock out
Clock distributionClock out
Figure 7
3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.
Del
ay
Grid
Tunedsectortrees
Sectorbuffers
Buffer level 2
Buffer level 1
y
x
Figure 8
Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.
Del
ay Multiple-fingeredtransmissionline
yx
J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 4 6 NO. 1 JANUARY 2 0 0 2
32
Clock Tree Delays, IBM Power
clock grid was completed with a tool run at the chip level,connecting unit-level pins to the grid. At this point, theclock tuning and the bottom-up clock routing process stillhave a great deal of flexibility to respond rapidly to evenlate changes. Repeated practice routing and tuning wereperformed by a small, focused global clock team as theclock pins and buffer placements evolved to guaranteefeasibility and speed the design process.
Measurements of jitter and skew can be carried outusing the I/Os on the chip. In addition, approximately 1 0 0top-metal probe pads were included for direct probingof the global clock grid and buffers. Results on actualPOWER4 microprocessor chips show long-distanceskews ranging from 2 0 ps to 4 0 ps (cf. Figure 9). This isimproved from early test-chip hardware, which showedas much as 7 0 ps skew from across-chip channel-lengthvariations [1 9 ]. Detailed waveforms at the input andoutput of each global clock buffer were also measuredand compared with simulation to verify the specializedmodeling used to design the clock grid. Good agreementwas found. Thus, we have achieved a “correct-by-design”clock-distribution methodology. It is based on our designexperience and measurements from a series of increasinglyfast, complex server microprocessors. This method resultsin a high-quality global clock without having to usefeedback or adjustment circuitry to control skews.
Circuit designThe cycle-time target for the processor was set early in theproject and played a fundamental role in defining thepipeline structure and shaping all aspects of the circuitdesign as implementation proceeded. Early on, criticaltiming paths through the processor were simulated indetail in order to verify the feasibility of the designpoint and to help structure the pipeline for maximumperformance. Based on this early work, the goal for therest of the circuit design was to match the performance setduring these early studies, with custom design techniquesfor most of the dataflow macros and logic synthesis formost of the control logic—an approach similar to thatused previously [2 0 ]. Special circuit-analysis and modelingtechniques were used throughout the design in order toallow full exploitation of all of the benefits of the IBMadvanced SOI technology.
The sheer size of the chip, its complexity, and thenumber of transistors placed some important constraintson the design which could not be ignored in the push tomeet the aggressive cycle-time target on schedule. Theseconstraints led to the adoption of a primarily static-circuitdesign strategy, with dynamic circuits used only sparinglyin SRAMs and other critical regions of the processor core.Power dissipation was a significant concern, and it was akey factor in the decision to adopt a predominantly static-circuit design approach. In addition, the SOI technology,
including uncertainties associated with the modelingof the floating-body effect [2 1 –2 3 ] and its impact onnoise immunity [2 2 , 2 4 –2 7 ] and overall chip decouplingcapacitance requirements [2 6 ], was another factor behindthe choice of a primarily static design style. Finally, thesize and logical complexity of the chip posed risks tomeeting the schedule; choosing a simple, robust circuitstyle helped to minimize overall risk to the projectschedule with most efficient use of CAD tool and designresources. The size and complexity of the chip alsorequired rigorous testability guidelines, requiring almostall cycle boundary latches to be LSSD-compatible formaximum dc and ac test coverage.
Another important circuit design constraint was thelimit placed on signal slew rates. A global slew rate limitequal to one third of the cycle time was set and enforcedfor all signals (local and global) across the whole chip.The goal was to ensure a robust design, minimizingthe effects of coupled noise on chip timing and alsominimizing the effects of wiring-process variability onoverall path delay. Nets with poor slew also were foundto be more sensitive to device process variations andmodeling uncertainties, even where long wires and RCdelays were not significant factors. The general philosophywas that chip cycle-time goals also had to include theslew-limit targets; it was understood from the beginningthat the real hardware would function at the desiredcycle time only if the slew-limit targets were also met.
The following sections describe how these designconstraints were met without sacrificing cycle time. Thelatch design is described first, including a description ofthe local clocking scheme and clock controls. Then thecircuit design styles are discussed, including a description
Figure 9
Global clock wav eforms showing 20 ps of measured skew.
1.5
1.0
0.5
0.0
0 500 1000 1500 2000 2500
20 ps skew
Vol
ts (
V)
Time (ps)
IBM J. RES. & DEV. VOL. 4 6 NO. 1 JANUARY 2 0 0 2 J. D. WARNOCK ET AL.
33
34
EE141
Power Distribution
EE141
Power Supply Impedance No voltage source is ideal: ||Z|| > 0 Two principal elements increase Z: Resistance of supply lines (IR drop) Inductance of supply lines (L⋅di/dt drop)
36
EE141
Technology (µm)0.10.20.30.40.50.6
10-3
10-2
10-1
100
Scaling and Supply Impedance Typical target for supply impedance is to get 5-10%
voltage variation of nominal supply (e.g., 100mV for 1V supply)
In traditional scaling Vdd drops while power stays constant.
This forced drastic drop in required supply impedance: Vdd ↓, Idd ↑ à |Zrequired| ↓↓
Today’s chips: |Zrequired| ≈ 1 mΩ!
Vdd = 1V, P=100W => Idd=100A For ΔVdd,max = 100mV,
Zdd,max = 100mV/100A = 1mΩ!
Impedance Requirements of High-Performance Processors
Req
uire
d Im
peda
nce
(Ω)
37
EE141
IR Drop Example Intel Pentium 4: ~103W at ~1.275V
Idd = 81Amps
For 10% IR drop, total distribution resistance must be less than 1.6mΩ
On-chip wire R ≈ 20mΩ/sq. (thick metal) Can’t meet R requirement even with multiple,
complete layers dedicated to power Main motivation for flip-chip packaging
38
EE141
Power Delivery
Achieving such low impedance requires a lot of resources: ~70% of package pins just for power Top 2-3 (thick) metal layers 39
Chip Metal Layers
EE141
Not Just Impedance - Electromigration
On-chip wires: current limited to ~1mA/µm for 5-7 year lifetime
40
EE141
On-Chip Power Distribution Power network usually follows pre-defined
template (often referred to as “power grid”)
VDD Gnd VDD Gnd
VDD
Gnd
VDD
Gnd 41
EE141
3 Metal Layer Approach (EV4)3rd “coarse and thick” metal layer added to the
technology for EV4 design Power supplied from two sides of the die via 3rd metal layer
2nd metal layer used to form power grid 90% of 3rd metal layer used for power/clock routing
Metal 3
Metal 2Metal 1
Courtesy Compaq 42
EE141
4 Metal Layers Approach (EV5)4th “coarse and thick” metal layer added to the
technology for EV5 design Power supplied from four sides of the die Grid strapping done all in coarse metal
90% of 3rd and 4th metals used for power/clock routing
Metal 3
Metal 2Metal 1
Metal 4
Courtesy Compaq 43
EE141
2 reference plane metal layers added to the technology for EV6 design Solid planes dedicated to Vdd/Vss
Lowers on-chip inductance
6 Metal Layer Approach – EV6
Metal 4
Metal 2Metal 1
RP2/Vdd
RP1/Vss
Metal 3
Courtesy Compaq 44
EE141
Pin Inductance Major source of inductance is
through the bonding pin connections to the chip package.
C4 bump inductance is 25pH wire-bond inductance of 1nH/mm
45
V = L ⋅dIdt
Example: Processor transient current is 50A in 20ps from 1V supply How many C4 bumps do we need to get supply noise
spike of less than 10%? With wirebonds, how many wirebonds are needed?
EE141
Pin Inductance Example Processor transient current is 50A in 20ps from 1V supply:
C4 bump inductance is 25pH How many C4 bumps do we need to get supply noise spike
of less than 10%?
With wirebond inductance of 1nH how many wirebonds are needed?
46
V = L ⋅dIdt
L = V ⋅dtdI
= 0.1V ⋅20ps50A
= 0.04pH
Inductors in parallel add
25pH0.04pH
= 625
1000pH0.04pH
> 25K!
EE141
SUPPLY
Boardwiring
Bondingwire
Decouplingcapacitor
CHIPCd
1
2
Decoupling Capacitors Help with Inductive and Resistive Effects
On the board (right under the supply pins) On the chip (under the supply straps, near large buffers) Cd helps avoid current rushing through supply wires
local store of charge “smoothing filter” on supply voltage
Decoupling capacitors are added:
47
Cd
EE141
Decoupling Capacitors
Under the die
48