Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
2010-10-11John Wawrzynek and Krste Asanovic
with John Lazzaro
CS 250 VLSI System Design
Lecture 10 – Design Verification
www-inst.eecs.berkeley.edu/~cs250/
TA: Yunsup Lee
1
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
multi-site team, necessitating the development of ways tosynchronize the design environment and data (as well asthe design team).
In the following sections of this paper, the designmethodology, clock network, circuits, power distribution,integration, and timing approaches used to meet thesechallenges for the POWER4 chip are described, andresults achieved for POWER4 are presented.
Design methodologyThe design methodology for the POWER4 microprocessorfeatured a hierarchical approach across multiple aspects ofthe design. The chip was organized physically and logically
in a four-level hierarchy, as illustrated in Figure 2.
typically containing 50 000 transistors. Units compriseapproximately 50 related macros, with the microprocessorcore made up of six units. The highest level is the chip,which contains two cores plus the units associated with theon-chip memory subsystem and interconnection fabric.This hierarchy facilitates concurrent design across all fourlevels. While the macros (blocks such as adders, SRAMs,and control logic) are being designed at the transistor and
Figure 1
POWER4 chip photograph showing the principal functional units in the microprocessor core and in the memory subsystem.
Figure 2
Elements in the physical and logical hierarchy used to design the POWER4 chip.
Core Core
Chip
Core
FPU FXU
IFU Unit F
Unit F
Unit F
Macro 1
Unit ZUnit X
Macro n
Macro n
Macro 1
Macro 3
Macro 2
Unit A
Macros, units, core,and chip all generateinitial timing andfloorplan contracts
Memory subsystem
Table 1 Features of the IBM CMOS 8S3 SOItechnology.
Gate Leff 0.09 !mGate oxide 2.3 nm
Metal layers pitch thicknessM1 0.5 !m 0.31 !mM2 0.63 !m 0.31 !mM3–M5 0.63 !m 0.42 !mM6 (MQ) 1.26 !m 0.92 !mM7 (LM) 1.26 !m 0.92 !m
Dielectric "r !4.2
Vdd 1.6 V
Table 2 Characteristics of the POWER4 chip fabricatedin CMOS 8S3 SOI.
Clock frequency ( fc) "1.3 GHzPower 115 W (@ 1.1 GHz, 1.5 V)Transistors 174,000,000
Macros (unique/total) 1015 4341Custom 442 2002RLM 523 2158SRAM 50 181
Total C4s 6380Signal I/Os 2200I/O bandwidth "500 Mb/s
Bus frequency 1/2 fcEngineered wiresBuffers and invertersDecoupling cap 300 nF
J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002
28
35K
The smallest members of the hierarchy are “macros”
100K
IBM Power 4174 Million TransistorsA complex design ...
96% of all bugs were caught before first tape-out.
First silicon booted AIX & Linux, on a 16-die system.
How ???2
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Three main components ...(1) Specify chip behavior at the RTL level, and comprehensively simulate it.
(2) Use formal verification to show equivalence betweenVerilog RTL and circuit schematic RTL.
(3) Technology layer: do the the electrons implement the RTL, at speed and power?
Today, we focus on (1).
3
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Lecture Focus: Functional Design Test1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
testing goal
The processor design
correctly executes programs
written in the Instruction Set Architecture
Not manufacturing
tests ...
Intel XScale ARM Pipeline, IEEE Journal of Solid State Circuits, 36:11, November 2001
“Correct” == meets the
“Architect’s Contract”
4
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Architect’s “Contract with the Programmer”
To the program, it appears that instructions execute in the correct order defined by the ISA.
What the machine actually does is up to the hardware designers, as long as the contract is kept.
As each instruction completes, thearchitected machine state appears to the program to obey the ISA.
5
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Three models (at least) to cross-check.1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
The Verilog RTL model
Chip-level schematic RTL
The “contract” specification“The answer” (correct, we hope).Simulates the ISA model in C. Fast.Better: two models coded independently.
Logical semantics of the Verilog model we will use to create gates. Runs on a software simulator or FPGA hardware.
Catch synthesis bugs. Formally verify netlist against Verilog RTL. Also used for timing and power.
Where do bugs come from?
6
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Where bugs come from (a partial list) ...1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
The contract is wrong.You understand the contract, create a design that correctly implements it, write correct Verilog for the design ...
The contract is misread.Your design is a correct implementation of what you think the contract means ... but you misunderstand the contract.
Conceptual error in design.You understand the contract, but devise an incorrect implementation of it ...
Verilog: name misspellings, latch implication, combinational loops.
Verilog coding errors.You express your correct design idea in Verilog .. with incorrect Verilog semantics.
7
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Four Types of Testing
8
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Big Bang: Complete Processor TestingTop-down
testing
Bottom-uptesting
how it works
Assemble the complete
processor.
Execute test program suite
on theprocessor.
Check results.
complete processor
testing
Checks contract model against Verilog RTL. Test suite runs the gamut from “1-line programs” to “boot the OS”.
9
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Methodical Approach: Unit Testing
complete processor
testing
Requires writing a bug-free “contract model”
for the unit.
Top-downtesting
Bottom-uptesting
Remove a block from the
design.
Test it in isolation against
specification.unit testing
how it works
10
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Climbing the Hierarchy: Multi-unit Testing
complete processor
testing
Choice of partition determines if the test moves
the project forward.
unit testing
Top-downtesting
Bottom-uptesting
Remove connected
blocksfrom design.
Test in isolation against
specification.
multi-unit testing
how it works
11
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Top-downtesting
Bottom-uptesting
Processor Testing with Self-Checking Units
complete processor
testing
unit testing
multi-unit testing
how it works
Add self-checking
to units
Perform complete processor
testing
processortesting
withself-checks
Self-checks are unit tests built into CPU, that generate the “right answer” on the fly.
Slower to simulate.
12
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Testing: Verification vs. Diagnostics
complete processor
testing
Top-downtesting
Bottom-uptesting
unit testing
multi-unit testing
processortesting
withself-checks
Diagnosis of bugs found during “complete processor” testing is hard ...
Verification:A yes/no answer to the question “Does the processor have one more bug?”
Diagnostics:
Clues to help find and fix the bug.
13
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
“CPU program” diagnosis is tricky ...
Observation: On a buggy CPU model, the correctness of every executed instruction is suspect.
Consequence: One needs to verify the correctness of instructions that surround the suspected buggy instruction.
Depends on (1) number of “instructions in flight” in the machine, and (2) lifetime of non-architected state (may be “indefinite”).
14
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
State observability and controllability
complete processor
testing
Top-downtesting
Bottom-uptesting
unit testing
multi-unit testing
processortesting
withself-checks
Observability:Does my model expose the state I need to diagnose the bug?
Controllability:Does my model support changing the state value I need to change to diagnose the bug?
Support != “yes, just rewrite the model code”!
15
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Writing a Test Plan
16
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
The testing timeline ...
complete processor
testing
Top-downtesting
Bottom-uptesting
unit testing
multi-unit testing
processortesting
withself-checks
processorassemblycomplete
correctlyexecutes
singleinstructions
correctlyexecutes
shortprograms
Time
Epoch 1 Epoch 2 Epoch 3 Epoch 4
Plan in advance what tests to do when ...
17
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
An example test plan ...
processorassemblycomplete
correctlyexecutes
singleinstructions
correctlyexecutes
shortprograms
Time
Epoch 1 Epoch 2 Epoch 3 Epoch 4unit testingearly
multiunit
testinglater
processortesting
withself-checks
multi-unit testing
unit testing
diagnostics
complete processor
testingverification
processortesting
withself-checks
diagnostics
processortesting
withself-checks
multi-unit testing
unit testing
diagnostics
complete processor
testing
Top-downtesting
Bottom-uptesting
unit testing
multi-unit testing
processortesting
withself-checks
18
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Unit Testing
19
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Combinational Unit Testing: 3-bit Adder
3A
3B3 Sum
Cout
Cin
+
Number of input bits ? 7
Total number of possible input values?
27 = 128
Just test them all ...Apply “test vectors”0,1,2 ... 127 to inputs.
100% input space “coverage”“Exhaustive testing”
20
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Combinational Unit Testing: 32-bit Adder
32A
32B
32 Sum
Cout
Cin
+
Number of input bits ? 65
Total number of possible input values?
3.689e+19
Just test them all?Exhaustive testing does not “scale”.
“Combinatorial explosion!”
265
=
21
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Test Approach 1: Random Vectors
32A
32B
32 Sum
Cout
Cin
+
how it works
Apply randomA, B, Cin to adder.
Check Sum, Cout.
Bug curve.
Bugs found per minute of testing
Time
Bug Rate
When to stop testing?
How? Use $random to set inputs to the testbench.
22
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Test Approach 2: Directed Vectors
32A
32B
32 Sum
Cout
Cin
+
how it works
Hand-craft test vectors
to cover“corner cases”
A == B == Cin == 0
“Black-box”: Corner cases based on functional properties.
“Clear-box”: Corner cases based on unit internal structure.
Power Tool:
Directed
Random
23
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
State Machine Testing
CPU design examplesDRAM controller state machines
Cache control state machines Branch prediction state machines
24
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Testing State Machines: Break Feedback
Next State Combinational Logic
ChangeRst
YGD QD Q D QR
Isolate “Next State” logic. Test as a combinational unit.
Easier with certain Verilog coding styles ...
25
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Testing State Machines: Arc Coverage
Change == 1
Change == 1 Change == 1R Y G1 0 0
R Y G0 0 1
R Y G0 1 0
Rst == 1
Force machine into each state. Test behavior of each arc.
Intractable for state machines with high edge density ...
26
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Regression Testing
Or, how to find the last bug ...
27
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Writing “complete CPU” test programs
processorassemblycomplete
correctlyexecutes
singleinstructions
correctlyexecutes
shortprograms
Time
Epoch 1 Epoch 2 Epoch 3 Epoch 4processor
testingwith
self-checks
complete processor
testingprocessor
testingwith
self-checks
complete processor
testing
Top-downtesting
Bottom-uptesting
unit testing
multi-unit testing
processortesting
withself-checks
Single instructions with directed-random field values.
White-box “Instructions-in-flight” sized programs that stress design.
Tests that stress long-lived non-architected state.Regression testing: re-run subsets of the test
library, and then the entire library, after a fix.28
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
2010-10-11John Wawrzynek and Krste Asanovic
with John Lazzaro
CS 250 VLSI System Design
Lecture 10 – Pipeline Micro-architecture
www-inst.eecs.berkeley.edu/~cs250/
TA: Yunsup Lee
29
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Pipelining Basics
30
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Starting Point: Performance Equation
SecondsProgram
InstructionsProgram
= SecondsCycle
CPI: The average number of clock
Cycles Per Instruction For the Program
InstructionCycles
Rationale: Every additional instruction you execute takes time.
Rationale: By shortening the period for each cycle, we shorten execution time.
Different programs have different CPIs, for a variety of reasons.
31
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Consider machine with a data cache ...
InstructionsProgram
= SecondsCycle
A program’s load instructions “stride”
through every memory address.
The cache never “hits”, so every load goes to
DRAM (100x slower than loads that go to cache).
Thus, the average number of cycles for load instructions is higher for this program.
InstructionCycles
Thus, the average number of cycles for all instructions is higher for this program.
SecondsProgram
Thus, program takes longer to run!
32
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Starting Point: Single-cycle processor
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Dout
Data Memory
WE
Din
Addr
MemToReg
Addr Data
Instr
Mem32A
L
U
32
32
op
Ext
SecondsProgram
InstructionsProgram
= SecondsCycle Instruction
Cycles
CPI == 1This is good.
Slow.This is bad.
Challenge: Speed up clock while keeping CPI == 1
33
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Observation: Logic idle most of cycle
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Dout
Data Memory
WE
Din
Addr
MemToReg
Addr Data
Instr
Mem32A
L
U
32
32
op
Ext
For most of cycle, ALU is either “waiting” for its inputs, or “holding” its output
Ideal: a CPU architecture where each part is always “working”.
34
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Inspiration: Automobile assembly lineAssembly line moves on a steady clock.
Each station does the same task on each car.Car
body shell
Car chassis
Mergestation
Boltingstation
The clock
35
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Inspiration: Automobile assembly lineSimpler station tasks → more cars per hour.Simple tasks take less time, clock is faster.
36
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Inspiration: Automobile assembly lineLine speed limited by slowest task.
Most efficient if all tasks take same time to do
37
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Inspiration: Automobile assembly lineSimpler tasks, complex car → long line!
These lines go 24 x 7, and rarely shut down.
38
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Key analogy: The instruction is the car
D
PC
Q
+
0x4
Addr Data
Instr
Mem
IR IR IR
Instruction Fetch
IR
Pipeline Stage #1 Stage #2
Controlshardware
in stage 2
Stage #3
Controlshardware
in stage 3
Stage #4
Controlshardware
in stage 4
Stage #5
Controlshardware
in stage 5
“Data-stationary control”
39
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Example: Decode & Register Fetch stage
D
PC
Q
+
0x4
Addr Data
Instr
Mem
IR
Instr Fetch
Pipeline Stage #1
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
Ext
IR
B
A
M
Stage #2
Decode & Reg Fetch
IR
Stage #3
ADD R4,R3,R2OR R7,R6,R5SUB R10, R9,R8
ADD R4,R3,R2OR R7,R6,R5SUB R10,R9,R8
A sample program
R’s chosen so that instructions are
independent - like cars on the line.
40
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Decode & Reg Fetch
Performance Equation and Pipelining
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Addr Data
Instr
Mem
Ext
IR IR IR
B
A
M
Instr Fetch Stage #3
SecondsProgram
InstructionsProgram= Seconds
Cycle InstructionCycles
To get shortest clock period,
balance the work to do in each
pipeline stage.
CPI == 1Once pipe is fill,one instructioncompletes per
cycle
Clock period is shorter
Less work to do in each cycle
41
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Hazards: An instruction is not a car ...
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Addr Data
Instr
Mem
Ext
IR IR IR
B
A
M
Instr Fetch
Stage #1 Stage #2 Stage #3
Decode & Reg Fetch
ADD R4,R3,R2OR R5,R4,R2
An example of a “hazard” -- we must
(1) detect and (2) resolve all hazards
to make a CPU that matches ISA
R4 not written yet ...... wrong value of R4 fetched from RegFile, contract with programmer broken! Oops! ADD R4,R3,R2
OR R5,R4,R2
New sample program
42
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Decode & Reg Fetch
Performance Equation and Hazards
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Addr Data
Instr
Mem
Ext
IR IR IR
B
A
M
Instr Fetch Stage #3
SecondsProgram
InstructionsProgram= Seconds
Cycle InstructionCycles
Some ways to cope with hazards
makes CPI > 1“stalling pipeline”
Added logic to detect and resolve hazards increases
clock period
43
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
A (simplified) 5-stage pipelined CPU
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Addr Data
Instr
Mem
Ext
IR IR
B
A
M
Instr Fetch
“IF” Stage “ID/RF” Stage
Decode & Reg Fetch
1 2
“EX” StageExecution
32A
L
U
32
32
op
IR
Y
M
3
IR
Dout
Data Memory
WE
Din
Addr
MemToReg
R
“MEM” StageMemory
WE, MemToReg
4WB5
WriteBack
Mux,Logic
44
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Visualizing Pipelines
45
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Pipeline Representation #1: Timeline
D
PC
Q
+
0x4
Addr Data
Instr
Mem
IR IR
IF (Fetch) ID (Decode) EX (ALU)
IR IR
MEM WB
ADD R4,R3,R2
OR R7,R6,R5
SUB R1,R9,R8XOR R3,R2,R1
AND R6,R5,R4I1:I2:I3:I4:I5:
Sample Program
IF ID
IF
EX
ID
IF
MEM
EX
ID
IF
WB
MEM
EX
IFID
WB
MEM
IDEX
IF
WB
EXMEM
IDMEMWB
EX
I1:I2:I3:I4:I5:
t1 t2 t3 t4 t5 t6 t7 t8Time:Inst
I6:
Good for visualizing pipeline fills.
Pipeline is “full”
46
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Pipeline is “full”
Good for visualizing pipeline stalls.
Representation #2: Resource Usage
D
PC
Q
+
0x4
Addr Data
Instr
Mem
IR IR IR IR
ADD R4,R3,R2
OR R7,R6,R5
SUB R1,R9,R8XOR R3,R2,R1
AND R6,R5,R4I1:I2:I3:I4:I5:
Sample Program
I1 I2
I1
I3
I2
I1
I4
I3
I2
I1
I5
I4
I3
I1I2
IF:ID:EX:MEM:WB:
t1 t2 t3 t4 t5 t6 t7 t8Time:Stage
IF (Fetch) ID (Decode) EX (ALU) MEM WB
I5
I4
I2I3
I6
I5
I3I4
I6
I7
I4I5
I6
I7
I8
47
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Data and Control Hazards
48
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Data Hazards: 3 Types (RAW, WAR, WAW)
Several pipeline stages read or write thesame data location in an incompatible way.
Read After Write (RAW) hazards.Instruction I2 expects to read a datavalue written by an earlier instruction,but I2 executes “too early” and readsthe wrong copy of the data.
Note “data value”, not “register”. Data hazards are possible for any architected state (such as main memory). In practice, main memory hazard avoidance is the job of the memory system.
49
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Recall: RAW example
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Addr Data
Instr
Mem
Ext
IR IR IR
B
A
M
Instr Fetch
Stage #1 Stage #2 Stage #3
Decode & Reg Fetch
ADD R4,R3,R2OR R5,R4,R2
R4 not written yet ...... wrong value of R4 fetched from RegFile, contract with programmer broken! Oops!
ADD R4,R3,R2OR R5,R4,R2
Sample program
This is what we mean
when we say Read After
Write (RAW) Hazard
50
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
I1:I2:I3:I4:I5:
t1 t2 t3 t4 t5 t6 t7 t8Time:Inst
I6:
Control Hazards: A taken branch/jump
D
PC
Q
+
0x4
Addr Data
Instr
Mem
IR IR
IF (Fetch) ID (Decode) EX (ALU)
IR IR
MEM WB
BEQ R4,R3,25
SUB R1,R9,R8AND R6,R5,R4
I1:I2:I3:
Sample Program(ISA w/o branch delay slot) IF ID
IF
EX
ID
IF
MEM WBEX stage computes if branch is taken
Note: with branch delay slot, I2 MUST complete, I3 MUST NOT complete.
If branch is taken, these instructions
MUST NOT complete!51
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Hazard Resolution Tools
52
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
The Hazard Resolution Toolkit
Stall earlier instructions in pipeline.
Kill earlier instructions in pipeline.
Forward results computed in later pipeline stages to earlier stages.Add new hardware or rearrange hardware design to eliminate hazard.
Make hardware handle concurrent requests to eliminate hazard.
Change ISA to eliminate hazard.
53
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Resolving a RAW hazard by stalling
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Addr Data
Instr
Mem
Ext
IR IR IR
B
A
M
Instr Fetch
Stage #1 Stage #2 Stage #3
Decode & Reg Fetch
ADD R4,R3,R2OR R5,R4,R2
Let ADD proceed to WB stage, so that R4 is written to regfile.
ADD R4,R3,R2OR R5,R4,R2
Sample programKeep executingOR instructionuntil R4 is ready.Until then, sendNOPS to IR 2/3.
Freeze PC and IR until stall is over.
New datapath hardware
(1) Mux into IR 2/3to feed in NOP.
(2) Write enable on PC and IR 1/2
54
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
The Hazard Resolution Toolkit
Stall earlier instructions in pipeline.
Kill earlier instructions in pipeline.
Forward results computed in later pipeline stages to earlier stages.Add new hardware or rearrange hardware design to eliminate hazard.
Make hardware handle concurrent requests to eliminate hazard.
Change ISA to eliminate hazard.
55
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Addr Data
Instr
Mem
Ext
IR IR
B
A
M
Instr Fetch
“IF” Stage “ID/RF” Stage
Decode & Reg Fetch
1 2
“EX” StageExecution
32A
L
U
32
32
op
IR
Y
M
3
Resolving a RAW hazard by forwarding
ADD R4,R3,R2OR R5,R4,R2ADD R4,R3,R2OR R5,R4,R2
Sample program
ALU computes R4 in the EX stage, so ...Just forward it
back!
Unlike stalling, does not change CPI. May hurt cycle time.
56
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
The Hazard Resolution Toolkit
Stall earlier instructions in pipeline.
Kill earlier instructions in pipeline.
Forward results computed in later pipeline stages to earlier stages.Add new hardware or rearrange hardware design to eliminate hazard.
Make hardware handle concurrent requests to eliminate hazard.
Change ISA to eliminate hazard.
57
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
I1:I2:I3:I4:I5:
t1 t2 t3 t4 t5 t6 t7 t8Time:Inst
I6:
Control Hazards: Fix with more hardware
D
PC
Q
+
0x4
Addr Data
Instr
Mem
IR IR
IF (Fetch) ID (Decode) EX (ALU)
IR IR
MEM WB
BEQ R4,R3,25
SUB R1,R9,R8AND R6,R5,R4
I1:I2:I3:
Sample Program(ISA w/o branch delay slot) IF ID
IF
EX
ID
IF
MEM WBEX stage computes if branch is taken
If branch is taken, these instructions
MUST NOT complete!
If we add hardware, can we move it here?
58
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Resolving control hazard with hardware
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Addr Data
Instr
Mem
Ext
IR IR IR
B
A
M
Instr Fetch
Stage #1 Stage #2 Stage #3
Decode & Reg Fetch
==
To branch control logic
59
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
I1:I2:I3:I4:I5:
t1 t2 t3 t4 t5 t6 t7 t8Time:Inst
I6:
Control Hazards: After more hardware
D
PC
Q
+
0x4
Addr Data
Instr
Mem
IR IR
IF (Fetch) ID (Decode) EX (ALU)
IR IR
MEM WB
BEQ R4,R3,25
SUB R1,R9,R8AND R6,R5,R4
I1:I2:I3:
Sample Program(ISA w/o branch delay slot) IF ID
IF
EX MEM WB
If branch is taken, this instruction MUST NOT
complete!
ID stage computes if branch is taken
If we change ISA, can we always let I2 complete (”branch delay slot”) and
eliminate the control hazard.
60
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
The Hazard Resolution Toolkit
Stall earlier instructions in pipeline.
Kill earlier instructions in pipeline.
Forward results computed in later pipeline stages to earlier stages.Add new hardware or rearrange hardware design to eliminate hazard.
Make hardware handle concurrent requests to eliminate hazard.
Change ISA to eliminate hazard.
61
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Resolve control hazard by killing instr
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Addr Data
Instr
Mem
Ext
IR IR IR
B
A
M
Instr Fetch
Stage #1 Stage #2 Stage #3
Decode & Reg Fetch
J 200
J 200OR R5,R4,R2
Sample program(no delay slot) Detect J
instruction, muxa NOP into IR 1/2
Compute new PC using hardware not shown ...
This hurts CPI.
One can do better.
62
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Hazard Diagnosis
Assume MIPS ISA in examples to follow ...
63
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Data Hazards: Read After Write
Read After Write (RAW) hazards.Instruction I2 expects to read a datavalue written by an earlier instruction,but I2 executes “too early” and readsthe wrong copy of the data.
Classic solution: use forwarding heavily, fall back on stalling when forwarding won’twork or slows down the critical path too much.
64
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Mux,Logic
Full bypass network ...
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
Ext
IR IR
B
A
M
32A
L
U
32
32
op
IR
Y
M
IR
Dout
Data Memory
WE
Din
Addr
MemToReg
R
WE, MemToReg
ID (Decode) EX MEM WB
From WB
65
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Mux,Logic
Common bug: Multiple forwards ...
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
Ext
IR IR
B
A
M
32A
L
U
32
32
op
IR
Y
M
IR
Dout
Data Memory
WE
Din
Addr
MemToReg
R
WE, MemToReg
ID (Decode) EX MEM WB
From WB
ADD R4,R3,R2 OR R2,R3,R1 AND R2,R2,R1
Which do we forward from?
66
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Mux,Logic
Common bug: Multiple forwards II ...
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
Ext
IR IR
B
A
M
32A
L
U
32
32
op
IR
Y
M
IR
Dout
Data Memory
WE
Din
Addr
MemToReg
R
WE, MemToReg
ID (Decode) EX MEM WB
From WB
ADD R4,R0,R2 OR R0,R3,R1 AND R0,R2,R1
Which do we forward from?
67
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
LW and Hazards
No load “delay slot”
68
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Mux,Logic
Questions about LW and forwarding
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
Ext
IR IR
B
A
M
32A
L
U
32
32
op
IR
Y
M
IR
Dout
Data Memory
WE
Din
Addr
MemToReg
R
WE, MemToReg
ID (Decode) EX MEM WB
From WB
ADDIU R1 R1 24 LW R1 128(R29)
Do we need to stall ?OR R3,R3,R2
69
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Mux,Logic
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
Ext
IR IR
B
A
M
32A
L
U
32
32
op
IR
Y
M
IR
Dout
Data Memory
WE
Din
Addr
MemToReg
R
WE, MemToReg
ID (Decode) EX MEM WB
From WB
ADDIU R1 R1 24 LW R1 128(R29)
Do we need to stall ?OR R1,R3,R1
Questions about LW and forwarding
70
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Branches and Hazards
Single “delay slot”
71
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Recall: Control hazard and hardware
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Addr Data
Instr
Mem
Ext
IR IR IR
B
A
M
Instr Fetch
Stage #1 Stage #2 Stage #3
Decode & Reg Fetch
==
To branch control logic
72
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
I1:I2:I3:I4:I5:
t1 t2 t3 t4 t5 t6 t7 t8Time:Inst
I6:
Recall: After more hardware, change ISA
D
PC
Q
+
0x4
Addr Data
Instr
Mem
IR IR
IF (Fetch) ID (Decode) EX (ALU)
IR IR
MEM WB
BEQ R4,R3,25
SUB R1,R9,R8AND R6,R5,R4
I1:I2:I3:
Sample Program(ISA w/o branch delay slot) IF ID
IF
EX MEM WB
If branch is taken, this instruction MUST NOT
complete!
ID stage computes if branch is taken
If we change ISA, can we always let I2 complete (”branch delay slot”) and
eliminate the control hazard.
73
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Mux,Logic
Question about branch and forwards:
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
Ext
IR IR
B
A
M
32A
L
U
32
32
op
IR
Y
M
IR
Dout
Data Memory
WE
Din
Addr
MemToReg
R
WE, MemToReg
ID (Decode) EX MEM WB
BEQ R1 R3 label
Will this work as shown?OR R3,R3,R1
==
To branch control logic
74
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Lessons learned
Pipelining is hard
Write test code in advance
Study every instruction
Think about interactions ...
75
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Control Implementation
76
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Recall: What is single cycle control?
32rd1
RegFile
32rd2
WE32wd
5rs1
5rs2
5ws
ExtRegDest
ALUsrcExtOp
ALUctr
32A
L
U
32
32
op
MemToReg
32Dout
Data Memory
WE32
Din
Addr
MemWr
Equal
RegWr
32Addr Data
InstrMem
Equal
RegDestRegWr
ExtOpALUsrc MemWr
MemToReg
PCSrc
Combinational Logic(Only Gates, No Flip Flops)Just specify logic functions!
77
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
In pipelines, all IR registers are used
IR IR IR IR
ID (Decode) EX MEM WB
Equal
RegDestRegWr
ExtOp MemToReg
PCSrc
Combinational Logic(Only Gates, No Flip Flops)
(add extra state outside!)
A “conceptual” design -- for shortest critical path, IR registers may hold decoded info,
not the complete 32-bit instruction
78
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Advanced Pipelining
79
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
5 Stage Pipeline: A point of departure
CS 152 L10 Pipeline Intro (9) Fall 2004 © UC Regents
Graphically Representing MIPS Pipeline
Can help with answering questions like:how many cycles does it take to execute this code?what is the ALU doing during cycle 4?is there a hazard, why does it occur, and how can it be fixed?
ALUIM Reg DM Reg
SecondsProgram
InstructionsProgram
= SecondsCycle Instruction
Cycles
At best, the 5-stage pipeline executes one instruction per
clock, with a clock period determined by the slowest stage
Filling all delay slots(branch,load)
Perfect
caching
Processor has no “multi-cycle” instructions (ex: multiply with an accumulate register)
80
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Superpipelining: Add more stagesSeconds
Program
Instructions
Program= Seconds
Cycle Instruction
Cycles
Goal: Reduce critical path byadding more pipeline stages.
Difficulties: Added penalties for load delays and branch misses.
Ultimate Limiter: As logic delay goes to 0, FF clk-to-Q and setup.
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Example: 8-stage ARM XScale:extra IF, ID, data cache stages.
Also, power!
81
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Seconds
Program
Instructions
Program= Seconds
Cycle Instruction
Cycles
Goal: Improve CPI by issuing several instructions per cycle.
Difficulties: Load and branchdelays affect more instructions.Ultimate Limiter: Programs maybe a poor match to issue rules.
!"#$%&
!"#$%
&'"()*+,-*.,,/
012.3-*4++556
789($:;9*<9:$*=
)'"'($%":#$:(#
>8#?
>8#?
.*(?(
.*(?(
+(?(+(?(+(?(
!"##$
%&%'#&(')
%*+,&*##$
%&%'#&(')
789($:;9*89:$#*)'@%*:9$%"9'A*B:B%A:9%*"%C:#$%"#
!"";B%"'9D#*'"%*A'$()%D*E)%9*'9*:9#$"8($:;9*
%9$%"#*'*F89($:;9*89:$*
!"":9B8$#*$;*'*F89($:;9*89:$*G%
1C1-*"%C:#$%"*F:A%
H
('9*()'9C%*D8":9C*'*A;9C*A'$%9(?*;B%"'$:;9
'((%B$
'((%B$
!"#$%
&'"()*+,-*.,,/
012.3-*4++550
&8A$:BA%*789($:;9*<9:$#
I7IJ
KL
M4<
&%N
7'DD
7N8A
7D:@
I##8%
OPQR#
7PQR#
Example: CPU with floating point ALUs: issue 1 FP + 1 integer instruction per cycle.
Superscalar: Multiple issues per cycle
82
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Throughput and multiple threadsGoal: Use multiple CPUs (real and virtual) to improve (1) throughput of machines that run many programs (2) execution time of multi-threaded programs.
Difficulties: Gaining full advantage requires rewriting applications, OS, libraries.
Ultimate limiter: Amdahl’s law, memory system performance.
Example: Sun Niagara (8 SPARCs on one chip).
83
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Superpipelining
84
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
CS
152 L10 Pipeline Intro (9)Fall 2004 ©
UC
Regents
Graphically R
epresenting MIP
S Pipeline
Can help w
ith answering questions like:
how m
any cycles does it take to execute this code?w
hat is the ALU
doing during cycle 4?is there a hazard, w
hy does it occur, and how can it be fixed?
ALU
IMR
egD
MR
eg
IR
ID+RF
EX
MEM
WB
IR
IR
IR
IF
5 Stage1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
8 Stage
IF now takes 2 stages (pipelined I-cache)
ID and RF each get a stage.ALU split over 3 stagesMEM takes 2 stages (pipelined D-cache)
Note: Some stages now overlap, some instructions
take extra stages.
85
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Superpipelining techniques ...
Split ALU and decode logic over several pipeline stages.
Pipeline memory: Use more banks of smaller arrays, add pipeline stages between decoders, muxes.
Remove “rarely-used” forwarding networks that are on critical path.
Pipeline the wires of frequently used forwarding networks.
Creates stalls, affects CPI.
Also: Clocking tricks (example: negedge register file)
86
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Add pipeline stages, reduce clock periodSeconds
Program
Instructions
Program= Seconds
Cycle Instruction
Cycles
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Q. Could adding pipeline stages hurt the CPI for an application?
ARM XScale8 stages
CPI Problem Possible Solution
Taken branches cause longer
stallsBranch prediction,
loop unrolling
Cache misses take more
clock cycles
Larger caches, add prefetch
opcodes to ISA
A. Yes, due to these problems:
87
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
I1:I2:I3:I4:I5:
t1 t2 t3 t4 t5 t6 t7 t8Time:Inst
I6:
Recall: Control hazards ...
D
PC
Q
+
0x4
Addr Data
Instr
Mem
IR IR
IF (Fetch) ID (Decode) EX (ALU)
IR IR
MEM WB
BEQ R4,R3,25
SUB R1,R9,R8AND R6,R5,R4
I1:I2:I3:
Sample Program(ISA w/o branch delay slot) IF ID
IF
EX
ID
IF
MEM WBEX stage computes if branch is taken
If branch is taken, these instructions
MUST NOT complete!
We avoiding stalling by (1) adding a branch delay slot, and (2) adding comparator to ID stageIf we add more early stages, we must stall.
I-Cache
88
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
I1:I2:I3:I4:I5:
t1 t2 t3 t4 t5 t6 t7 t8Time:Inst
I6:
D
PC
Q
+
0x4
Addr Data
Instr
Mem
IR IR
IF (Fetch) ID (Decode) EX (ALU)
IR IR
MEM WB
IF ID
IF
EX
ID
IF
MEM WBEX stage computes if branch is taken
If we predicted incorrectly, these instructions MUST
NOT complete!
We update the PC based on the outputs of the branch predictor. If it is perfect, pipe stays full!Dynamic Predictors: a cache of branch history
I-Cache
Solution: Branch prediction ...
A control instr?
Taken or Not Taken?
If taken, where to? What PC?
Branch Predictor
Predictions
89
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Superscalar
Basic Idea: Improve CPI by issuing several instructions per cycle.
90
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
IR IR
IF (Fetch) ID (Decode) EX (ALU)
IR IR
MEM WB
IR IR
IF (Fetch) ID (Decode) EX (ALU)
IR IR
MEM WB
rd1
RegFile
rd2
WE1
wd1
rs1
rs2
ws1
WE2
rd3
rd4
rs3
rs4
wd2
ws2
A
B
A
B
32A
L
U
32
32
op
Y
32A
L
U
32
32
op
Y
R
R
Superscalar R machine
Addr
DataInstrMem
64
32PC and
Sequencer
Instruction Issue Logic
91
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
IR IR
IF (Fetch) ID (Decode) EX (ALU)
IR IR
MEM WB
IR IR
ID (Decode) EX (ALU)
IR IR
MEM WB
rd1
RegFile
rd2
WE1
wd1
rs1
rs2
ws1
WE2
rd3
rd4
rs3
rs4
wd2
ws2
A
B
A
B
32A
L
U
32
32
op
Y
32A
L
U
32
32
op
Y
R
R
Sustaining Dual Instr Issues
(no forwarding)
ADD R21,R20,R19ADD R24,R23,R22
ADD R21,R20,R19
ADD R24,R23,R22
ADD R15,R14,R13ADD R18,R17,R16
ADD R15,R14,R13
ADD R18,R17,R16
ADD R27,R26,R25ADD R30,R29,R28
ADD R27
ADD R30
ADD R9,R8,R7
ADD R12,R11,R10
ADD R9,R8,R7ADD R12,R11,R10
ADD R8,R0,R0ADD R11,R0,R0
It’s rarely this good ...
92
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
IR IR
IF (Fetch) ID (Decode) EX (ALU)
IR IR
MEM WB
IR IR
ID (Decode) EX (ALU)
IR IR
MEM WB
rd1
RegFile
rd2
WE1
wd1
rs1
rs2
ws1
WE2
rd3
rd4
rs3
rs4
wd2
ws2
A
B
A
B
32A
L
U
32
32
op
Y
32A
L
U
32
32
op
Y
R
R
Worst-Case Instruction Issue
NOP
ADD R8,
ADD R8,R0,R0
ADD R9,R8,R0
ADD R9,R8,R0
ADD R10,R9,R0
ADD R10,R9,R0
ADD R11,R10,R0
ADD R11,R10,R0
NOP NOP NOP
Dependencies force
“serialization”
We add 12 forwarding buses (not shown).(6 to each ID from stages of both pipes).
93
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Multi-Threading
94
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Mux,Logic
Recall: Bypass network prevents stalls
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
Ext
IR IR
B
A
M
32A
L
U
32
32
op
IR
Y
M
IR
Dout
Data Memory
WE
Din
Addr
MemToReg
R
WE, MemToReg
ID (Decode) EX MEM WB
From WB
Instead of bypass: Interleave threads on the pipeline to prevent stalls ...
95
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Krste
November 10, 2004
6.823, L18--3
Multithreading
How can we guarantee no dependencies between instructions in a pipeline?
-- One way is to interleave execution of instructions from different program threads on same pipeline
F D X M W
t0 t1 t2 t3 t4 t5 t6 t7 t8
T1: LW r1, 0(r2)
T2: ADD r7, r1, r4
T3: XORI r5, r4, #12
T4: SW 0(r7), r5
T1: LW r5, 12(r1)
t9
F D X M W
F D X M W
F D X M W
F D X M W
Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe
Last instruction
in a thread
always completes
writeback before
next instruction
in same thread
reads regfile
KrsteNovember 10, 2004
6.823, L18--5
Simple Multithreaded Pipeline
Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage
+1
2 Thread
select
PC1
PC1
PC1
PC1
I$ IRGPR1GPR1GPR1GPR1
X
Y
2
D$
Introduced in 1964 by Seymour Cray4 CPUs,each run at 1/4 clock
Many variants ...
96
UC Regents Fall 2010 © UCBCS 250 L10: Design Verification
Upcoming: Project Proposals
97