Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Module 1: IntroductionModule 1: Introduction
CAD Design approaches
• Goal of each CAD design flow methodology is to increase productivity of the design engineer
• Increasing the abstraction level of the design methodology and tools is one approach:
Abstraction Design Data
Describe-synthesize
Schematic Capture-simulate
1.5K - 6K
300 - 600
Gates/eng./month
> 1M gates
100K - 500K gates
Design Sizes
SoC: System on a Chip
• The 2001 prediction: SoC’s will be > 12M gates
• How do you create million gate ASICs with same amount of resources?
• ...while • Decrease development time• Increase functionality and performance• Keep small design teams
• Design Methodology (Design flow)• Tools that support the Methodology• IP reuse (Intellectual Property)
ASIC and SoC Design flow
Nand gate: behaviorial, transistor, layout
Boolean Equation MaskTransistor
O <= NOT ( A1 AND B1);
Adder: behavior, netlist, transistor, layoutBehavioral model Structural model
Review
• To appreciate why we need high level design techniques
• We need to look over the past 30 years of chip development and their growing complexity
Ref: http://www.msm.cam.ac.uk/dmg/teaching/m101999/Ch8/index.htm
Memory Technology: DRAM Evolution
SoC: Intel Microprocessor History: 4004
• 1971 Intel 4004, 4-bit, 0.74 Mhz, 16 pins,2250 Transistors
• Intel publicly introduced the world’s first single chip microprocessor: U. S. Patent #3,821,715.
• Intel took the integrated circuit one step further, by placing CPU, memory, I/O on a single chip
SoC: Intel Microprocessor History: 8080
• 1974 Intel 8080, 8-bit, 2 Mhz, 40 pins,4500 Transistors
Altair 8800 ComputerBill Gates & Paul Allenwrite their first Microsoft
software product: Basic
SoC: Intel Microrocessor History: 8088
• 1979 Intel 8088, 16-bit internal, 8-bit external, 4.77 Mhz, 40 pins, 29000 Transistors
IBM PC/XT• 0.128M - 0.640M RAM• 0.360Kb, 5.25” Floppy• 10M Hard Disk
SoC: Intel Processor History: Penitum Pro
• 1995 Intel Pentium Pro, 32-bit ,200 Mhz internal clock, 66 Mhz external, Superpipelining, 16Kb L1 cache, 256Kb L2 cache, 387 pins, 5.5 Million Transistors
SoC: System on a chip (beyond Processor)
• The 2001 prediction: SoC’s will be > 12M gates
Module 2: The VHDL AdderModule 2: The VHDL Adder
SoC: System on a chip (beyond Processor)
• The 2001 prediction: SoC’s will be > 12M gates
ASIC and SoC Design flow
Modelling types
• Behavioral model• Explicit definition of mathematical relationship between
input and output• No implementation information• It can exist at multiple levels of abstraction
• Dataflow, procedural, state machines, …
• Structural model• A representation of a system in terms of
interconnections (netlist) of a set of defined component• Components can be described structurally or
behaviorally
Adder: behavior, netlist, transistor, layoutBehavioral model Structural model
Full Adder: alternative structural models
Are the behavioral models the same?
Why VHDL?
• The Complexity and Size of Digital Systems leads to• Breadboards and prototypes which are too costly
• Software and hardware interactions which are difficult to analyze without prototypes or simulations
• Difficulty in communicating accurate design information
• Want to be able to target design to a new technology while using same descriptions or reuse parts of design (IP)
Half Adder
• A Half-adder is a Combinatorial circuit that performs the arithmetic sum of two bits.
• It consists of two inputs (x, y) and two outputs (Sum, Carry) as shown.
X Y Carry Sum0 0 0 00 1 0 11 0 0 11 1 1 0
Behavioral Truth Table
Carry <= X AND Y;
Sum <= X XOR Y;
Half Adder: behavioral properties
• Event propertyThe event on a, from 1 to 0, changes the output
• Propagation delay propertyThe output changes after 5ns propagation delay
• Concurrency property: Both XOR & AND gates computenew output values concurrently when an input changes state
What are the behavioral properties of the half-adder ciruit?
Half Adder: Design Entity
• Design entityA component of a system whose behavior is to bedescribed and simulated
• Components to the description
• entity declarationThe interface to the designThere can only be one interface declared
• architecture constructThe internal behavior or structure of the designThere can be many different architectures
• configurationbind a component instance to an entity-architecture pair
Half Adder: Entity
ENTITY half_adder ISPORT (
a, b: IN std_logic;sum, carry: OUT std_logic
);END half_adder;
ENTITY half_adder ISPORT (
a, b: IN std_logic;sum, carry: OUT std_logic
);END half_adder;
• All keyword in capitals by convention
• VHDL is case insensitive for keywords as well as variables
• The semicolon is a statement separator not a terminator
• std_logic is data type which denotes a logic bit(U, X, 0, 1, Z, W, L, H, -)
• BIT could be used instead of std_logic but it is only (0, 1)
a Sum
b Carry
a Sum
b Carry
Half Adder: Architecture
ENTITY half_adder ISPORT (
a, b: IN std_logic;Sum, Carry: OUT std_logic
);END half_adder;
ENTITY half_adder ISPORT (
a, b: IN std_logic;Sum, Carry: OUT std_logic
);END half_adder;
ARCHITECTURE half_adder_arch_1 OF half_adder IS
BEGIN
Sum <= a XOR b;
Carry <= a AND b;
END half_adder_arch_1;
ARCHITECTURE half_adder_arch_1 OF half_adder IS
BEGIN
Sum <= a XOR b;
Carry <= a AND b;
END half_adder_arch_1;
must refer to entity name
must refer to entity name
Half Adder: Architecture with Delay
ENTITY half_adder ISPORT (
a, b: IN std_logic;Sum, Carry: OUT std_logic
);END half_adder;
ENTITY half_adder ISPORT (
a, b: IN std_logic;Sum, Carry: OUT std_logic
);END half_adder;
ARCHITECTURE half_adder_arch_2 OF half_adder IS
BEGIN
Sum <= ( a XOR b ) after 5 ns;
Carry <= ( a AND b ) after 5 ns;
END half_adder_arch_2;
ARCHITECTURE half_adder_arch_2 OF half_adder IS
BEGIN
Sum <= ( a XOR b ) after 5 ns;
Carry <= ( a AND b ) after 5 ns;
END half_adder_arch_2;
Full Adder: Architecture
ENTITY full_adder ISPORT (
x, y, z: IN std_logic;Sum, Carry: OUT std_logic
);END full_adder;
ENTITY full_adder ISPORT (
x, y, z: IN std_logic;Sum, Carry: OUT std_logic
);END full_adder;
ARCHITECTURE full_adder_arch_1 OF full_adder IS
BEGIN
Sum <= ( ( x XOR y ) XOR z );
Carry <= (( x AND y ) OR (z AND (x AND y)));
END full_adder_arch_1;
ARCHITECTURE full_adder_arch_1 OF full_adder IS
BEGIN
Sum <= ( ( x XOR y ) XOR z );
Carry <= (( x AND y ) OR (z AND (x AND y)));
END full_adder_arch_1;
Full Adder: Architecture with Delay
ARCHITECTURE full_adder_arch_2 OF full_adder ISSIGNAL S1, S2, S3: std_logic;
BEGINs1 <= ( a XOR b ) after 15 ns;s2 <= ( c_in AND s1 ) after 5 ns;s3 <= ( a AND b ) after 5 ns;Sum <= ( s1 XOR c_in ) after 15 ns;Carry <= ( s2 OR s3 ) after 5 ns;
END full_adder_arch_2;
ARCHITECTURE full_adder_arch_2 OF full_adder ISSIGNAL S1, S2, S3: std_logic;
BEGINs1 <= ( a XOR b ) after 15 ns;s2 <= ( c_in AND s1 ) after 5 ns;s3 <= ( a AND b ) after 5 ns;Sum <= ( s1 XOR c_in ) after 15 ns;Carry <= ( s2 OR s3 ) after 5 ns;
END full_adder_arch_2;
SIGNAL: Scheduled Event • SIGNAL
Like variables in a programming language such as C,signals can be assigned values, e.g. 0, 1
• However, SIGNALs also have an associated time valueA signal receives a value at a specific point in timeand retains that value until it receives a new value
at a future point in time (i.e. scheduled event)
• For example wave <= ‘0’, ‘1’ after 10 ns, ‘0’ after 15 ns, ‘1’ after 25 ns;
• The waveform of the signal isa sequence of values assigned to a signal over time
Hierarchical design: 2 bit adder
LIBRARY IEEE;USE IEEE.std_logic_1164.ALL;
ENTITY adder_bits_2 IS PORT (
Carry_In: IN std_logic;a1, b1, a2, b2: IN std_logic;Sum1, Sum2: OUT std_logic;Carry_Out: OUT std_logic
) END adder_bits_2;
LIBRARY IEEE;USE IEEE.std_logic_1164.ALL;
ENTITY adder_bits_2 IS PORT (
Carry_In: IN std_logic;a1, b1, a2, b2: IN std_logic;Sum1, Sum2: OUT std_logic;Carry_Out: OUT std_logic
) END adder_bits_2;
• The design interface to a two bit adder is
• Note: that the ports are positional dependant(Carry_In, a1, b1, a2, b2, Sum1, Sum2, Carry_out)
Hierarchical designs: Ripple Structural Model
ARCHITECTURE ripple_2_arch OF adder_bits_2 ISCOMPONENT full_adder
PORT (x, y, z: IN std_logic; Sum, Carry: OUT std_logic);END COMPONENT;
SIGNAL c1: std_logic; BEGIN
FA1: full_adder PORT MAP (Carry_in, a1, b1, Sum1, c1);
FA2: full_adder PORT MAP (c1, a2, b2, Sum2, Carry_Out);
END ripple_2_arch;
Assignment #1
(1) Using the full_adder_arch_2, a <= ‘1’, ‘0’ after 20 ns;b <= ‘0’, ‘1’ after 10 ns, ‘0’ after 15 ns, ‘1’ after 25 ns;c_in <= ‘0’, ‘1’ after 10 ns;
Hand draw the signal waveforms for a, b, c_in, s1, s2, s3, sum, c_out
(2) Write the entity and architecture for the full subtractor
(3) Write the entity and architecture for a 4 bit subtractor
Note: this is a hand written assignment, no programming.Although, you may want to type it in using a Word Processor.
Module 3: The VHDL N-bit AdderModule 3: The VHDL N-bit Adder
Full Adder: Truth Table
Combinatorial Logic Operators
AND z <= x AND y;
NAND z <= NOT (x AND y);
NOR z <= NOT (x OR Y);
OR z <= x OR y;
NOT z <= NOT (x); z<= NOT x;
XOR z <= (x and NOT y) OR (NOT x AND y);
XNOR z <= (x and y) OR (NOT x AND NOT y);
Full Adder: Architecture
ENTITY full_adder IS PORT (x, y, z: IN std_logic;
Sum, Carry: OUT std_logic); END full_adder;
ENTITY full_adder IS PORT (x, y, z: IN std_logic;
Sum, Carry: OUT std_logic); END full_adder;
ARCHITECTURE full_adder_arch_1 OF full_adder IS
BEGIN
Sum <= ( ( x XOR y ) XOR z );
Carry <= (( x AND y ) OR (z AND (x AND y)));
END full_adder_arch_1;
ARCHITECTURE full_adder_arch_1 OF full_adder IS
BEGIN
Sum <= ( ( x XOR y ) XOR z );
Carry <= (( x AND y ) OR (z AND (x AND y)));
END full_adder_arch_1;
Optional Architecture END name;Optional Architecture END name;
Entity DeclarationEntity Declaration
Optional Entity END name;Optional Entity END name;
Architecture DeclarationArchitecture Declaration
SIGNAL: Scheduled Event • SIGNAL
Like variables in a programming language such as C,signals can be assigned values, e.g. 0, 1
• However, SIGNALs also have an associated time valueA signal receives a value at a specific point in timeand retains that value until it receives a new value
at a future point in time (i.e. scheduled event)
• For example wave <= ‘0’, ‘1’ after 10 ns, ‘0’ after 15 ns, ‘1’ after 25 ns;
• The waveform of the signal isa sequence of values assigned to a signal over time
Full Adder: Architecture with Delay
ARCHITECTURE full_adder_arch_2 OF full_adder ISSIGNAL S1, S2, S3: std_logic;
BEGINs1 <= ( a XOR b ) after 15 ns;s2 <= ( c_in AND s1 ) after 5 ns;s3 <= ( a AND b ) after 5 ns;Sum <= ( s1 XOR c_in ) after 15 ns;Carry <= ( s2 OR s3 ) after 5 ns;
END;
ARCHITECTURE full_adder_arch_2 OF full_adder ISSIGNAL S1, S2, S3: std_logic;
BEGINs1 <= ( a XOR b ) after 15 ns;s2 <= ( c_in AND s1 ) after 5 ns;s3 <= ( a AND b ) after 5 ns;Sum <= ( s1 XOR c_in ) after 15 ns;Carry <= ( s2 OR s3 ) after 5 ns;
END;
Signals (like wires) are not PORTs they do not have direction (i.e. IN, OUT)
Signals (like wires) are not PORTs they do not have direction (i.e. IN, OUT)
Signal order:
ARCHITECTURE full_adder_arch_3 OF full_adder ISSIGNAL S1, S2, S3: std_logic;
BEGINCarry <= ( s2 OR s3 ) after 5 ns;Sum <= ( s1 XOR c_in ) after 15 ns;s3 <= ( a AND b ) after 5 ns;s2 <= ( c_in AND s1 ) after 5 ns;s1 <= ( a XOR b ) after 15 ns;
END;
ARCHITECTURE full_adder_arch_3 OF full_adder ISSIGNAL S1, S2, S3: std_logic;
BEGINCarry <= ( s2 OR s3 ) after 5 ns;Sum <= ( s1 XOR c_in ) after 15 ns;s3 <= ( a AND b ) after 5 ns;s2 <= ( c_in AND s1 ) after 5 ns;s1 <= ( a XOR b ) after 15 ns;
END;
ARCHITECTURE full_adder_arch_2 OF full_adder ISSIGNAL S1, S2, S3: std_logic;
BEGINs1 <= ( a XOR b ) after 15 ns;s2 <= ( c_in AND s1 ) after 5 ns;s3 <= ( a AND b ) after 5 ns;Sum <= ( s1 XOR c_in ) after 15 ns;Carry <= ( s2 OR s3 ) after 5 ns;
END;
ARCHITECTURE full_adder_arch_2 OF full_adder ISSIGNAL S1, S2, S3: std_logic;
BEGINs1 <= ( a XOR b ) after 15 ns;s2 <= ( c_in AND s1 ) after 5 ns;s3 <= ( a AND b ) after 5 ns;Sum <= ( s1 XOR c_in ) after 15 ns;Carry <= ( s2 OR s3 ) after 5 ns;
END;
No, this is not C!
Net-lists have same behavior & parallel
No, this is not C!
Net-lists have same behavior & parallel
Does it matter? No
The Ripple-Carry n-Bit Binary Parallel Adder
Hierarchical design: 2-bit adder
LIBRARY IEEE;USE IEEE.std_logic_1164.ALL;
ENTITY adder_bits_2 IS PORT (Cin: IN std_logic;
a0, b0, a1, b1: IN std_logic;S0, S1: OUT std_logic;Cout: OUT std_logic
); END;
LIBRARY IEEE;USE IEEE.std_logic_1164.ALL;
ENTITY adder_bits_2 IS PORT (Cin: IN std_logic;
a0, b0, a1, b1: IN std_logic;S0, S1: OUT std_logic;Cout: OUT std_logic
); END;
• The design interface to a two bit adder is
• Note: that the ports are positional dependant(Cin, a0, b0, a1, b1, S0, S1, Cout)
Hierarchical design: Component Instance
ARCHITECTURE ripple_2_arch OF adder_bits_2 ISCOMPONENT full_adder
PORT (x, y, z: IN std_logic; Sum, Carry: OUT std_logic);END COMPONENT;SIGNAL t1: std_logic;
BEGINFA1: full_adder PORT MAP (Cin, a0, b0, S0, t1);
FA2: full_adder PORT MAP (t1, a1, b1, s1, Cout);
END; Component instance #1 called FA1Component instance #1 called FA1
Component instance #2 called FA2Component instance #2 called FA2
Component DeclarationComponent Declaration
Positional versus Named Association
FA1: full_adder PORT MAP (Cin, a0, b0, S0, t1);
FA1: full_adder PORT MAP (Cin=>x, a0=>y, b0=>z, S0=>Sum, t1=>Carry);
FA1: full_adder PORT MAP (Cin=>x, a0=>y, b0=>z, S0=>Sum, t1=>Carry);
• Positional Association (must match the port order)
• Named Association: signal => port_name
FA1: full_adder PORT MAP (Cin=>x, a0=>y, b0=>z, t1=>Carry, S0=>Sum);
FA1: full_adder PORT MAP (Cin=>x, a0=>y, b0=>z, t1=>Carry, S0=>Sum);
FA1: full_adder PORT MAP (t1=>Carry, S0=>Sum, a0=>y, b0=>z, Cin=>x);
FA1: full_adder PORT MAP (t1=>Carry, S0=>Sum, a0=>y, b0=>z, Cin=>x);
Component by Named Association
ARCHITECTURE ripple_2_arch OF adder_bits_2 ISCOMPONENT full_adder
PORT (x, y, z: IN std_logic; Sum, Carry: OUT std_logic);END COMPONENT;SIGNAL t1: std_logic; -- Temporary carry signal
BEGIN-- Named associationFA1: full_adder PORT
MAP (Cin=>x, a0=>y, b0=>z, S0=>Sum, t1=>Carry);
-- Positional associationFA2: full_adder PORT MAP (t1, a1, b1, s1, Cout);
END; -- Comments start with a double dash-- Comments start with a double dash
Using vectors: std_logic_vector
ENTITY adder_bits_2 IS PORT (Cin: IN std_logic;
a0, b0, a1, b1: IN std_logic;S0, S1: OUT std_logic;Cout: OUT std_logic
); END;
ENTITY adder_bits_2 IS PORT (Cin: IN std_logic;
a0, b0, a1, b1: IN std_logic;S0, S1: OUT std_logic;Cout: OUT std_logic
); END;
• By using vectors, there is less typing of variables, a0, a1, ...
ENTITY adder_bits_2 IS PORT (Cin: IN std_logic;
a, b: IN std_logic_vector(1 downto 0);S: OUT std_logic_vector(1 downto 0);Cout: OUT std_logic
); END;
ENTITY adder_bits_2 IS PORT (Cin: IN std_logic;
a, b: IN std_logic_vector(1 downto 0);S: OUT std_logic_vector(1 downto 0);Cout: OUT std_logic
); END;
2-bit Ripple adder using std_logic_vector
ARCHITECTURE ripple_2_arch OF adder_bits_2 ISCOMPONENT full_adder
PORT (x, y, z: IN std_logic; Sum, Carry: OUT std_logic);END COMPONENT;SIGNAL t1: std_logic; -- Temporary carry signal
BEGINFA1: full_adder PORT MAP (Cin, a(0), b(0), S(0), t1);
FA2: full_adder PORT MAP (t1, a(1), b(1), s(1), Cout);END;
• Note, the signal variable usage is now different:a0 becomes a(0)
4-bit Ripple adder using std_logic_vector
ARCHITECTURE ripple_4_arch OF adder_bits_4 ISCOMPONENT full_adder
PORT (x, y, z: IN std_logic; Sum, Carry: OUT std_logic);END COMPONENT;SIGNAL t: std_logic_vector(3 downto 1);
BEGINFA1: full_adder PORT MAP (Cin, a(0), b(0), S(0), t(1));FA2: full_adder PORT MAP (t(1), a(1), b(1), S(1), t(2));FA3: full_adder PORT MAP (t(2), a(2), b(2), S(2), t(3));FA4: full_adder PORT MAP (t(3), a(3), b(3), S(3), Cout);
END;
• std_vectors make it easier to replicate structures• std_vectors make it easier to replicate structures
For-Generate statement: first improvement
ARCHITECTURE ripple_4_arch OF adder_bits_4 ISCOMPONENT full_adder
PORT (x, y, z: IN std_logic; Sum, Carry: OUT std_logic);END COMPONENT;SIGNAL t: std_logic_vector(3 downto 1);CONSTANT n: INTEGER := 4;
BEGINFA1: full_adder PORT MAP (Cin, a(0), b(0), S(0), t(1));FA2: full_adder PORT MAP (t(1), a(1), b(1), S(1), t(2));FA3: full_adder PORT MAP (t(2), a(2), b(2), S(2), t(3));
FA4: full_adder PORT MAP (t(n), a(n), b(n), S(n), Cout);END;
Constants never change valueConstants never change value
FA_f: for i in 1 to n-2 generateFA_i: full_adder PORT MAP (t(i), a(i), b(i), S(i), t(i+1));
end generate;
LABEL: before the for is not optionalLABEL: before the for is not optional
For-Generate statement: second improvement
ARCHITECTURE ripple_4_arch OF adder_bits_4 ISCOMPONENT full_adder
PORT (x, y, z: IN std_logic; Sum, Carry: OUT std_logic);END COMPONENT;SIGNAL t: std_logic_vector(4 downto 0);CONSTANT n: INTEGER := 4;
BEGINt(0) <= Cin; Cout <= t(n);FA_f: for i in 0 to n-1 generate
FA_i: full_adder PORT MAP (t(i), a(i), b(i), S(i), t(i+1));end generate;
END;
Keep track of vector sizesKeep track of vector sizes
N-bit adder using generic
• By using generics, the design can be generalized
ENTITY adder_bits_4 IS PORT (Cin: IN std_logic;
a, b: IN std_logic_vector(3 downto 0);S: OUT std_logic_vector(3 downto 0);Cout: OUT std_logic
); END;
ENTITY adder_bits_4 IS PORT (Cin: IN std_logic;
a, b: IN std_logic_vector(3 downto 0);S: OUT std_logic_vector(3 downto 0);Cout: OUT std_logic
); END;
ENTITY adder_bits_n IS
PORT (Cin: IN std_logic;a, b: IN std_logic_vector(n-1 downto 0);S: OUT std_logic_vector(n-1 downto 0);Cout: OUT std_logic
); END;
ENTITY adder_bits_n IS
PORT (Cin: IN std_logic;a, b: IN std_logic_vector(n-1 downto 0);S: OUT std_logic_vector(n-1 downto 0);Cout: OUT std_logic
); END;
GENERIC(n: INTEGER := 2);Default case is 2Default case is 2
a, b: IN std_logic_vector(n-1 downto 0);S: OUT std_logic_vector(n-1 downto 0);
For-Generate statement: third improvement
ARCHITECTURE ripple_n_arch OF adder_bits_n ISCOMPONENT full_adder
PORT (x, y, z: IN std_logic; Sum, Carry: OUT std_logic);END COMPONENT;SIGNAL t: std_logic_vector(n downto 0);
BEGINt(0) <= Cin; Cout <= t(n);FA: for i in 0 to n-1 generate
FA_i: full_adder PORT MAP (t(i), a(i), b(i), S(i), t(i+1));end generate;
END;
Stimulus Only Test Bench ArchitectureARCHITECTURE tb OF tb_adder_4 IS
COMPONENT adder_bits_nGENERIC(n: INTEGER := 2);PORT ( Cin: IN std_logic;
a, b: IN std_logic_vector(n-1 downto 0);S: OUT std_logic_vector(n-1 downto 0);Cout: OUT std_logic
END COMPONENT;SIGNAL x, y, Sum: std_logic_vector(n downto 0);SIGNAL c, Cout: std_logic;
BEGINx <= “0000”, “0001” after 50 ns, “0101”, after 100 ns;y <= “0010”, “0011” after 50 ns, “1010”, after 100 ns;c <= ‘1’, ‘0’ after 50 ns;UUT_ADDER_4: adder_bits_n GENERIC MAP(4)
PORT MAP (c, x, y, Sum, Cout);END;
Override defaultOverride default
Stimulus Only Test Bench Entity
ENTITY tb_adder_4 ISPORT (Sum: std_logic_vector(3 downto 0);
Cout: std_logic); END;
The output of the testbench will be observe by the digitalwaveform of the simulator.
Module 4: Delay models & std_ulogicModule 4: Delay models & std_ulogic
Delta Delay
Delta Delay: Example using scheduling
Inertial Delay
Transport Delay
Inertial and Transport Delay
Sig
a
b
Inertial Delay is useful for modeling logic gates
Transport Delay is useful for modeling data buses, networks
Combinatorial Logic Operators
AND z <= x AND y;
NAND z <= NOT (x AND y);
NOR z <= NOT (x OR Y);
OR z <= x OR y;
NOT z <= NOT (x); z<= NOT x;
XOR z <= (x and NOT y) OR (NOT x AND y);z <= (x AND y) NOR (x NOR y); --AOI
XNOR z <= (x and y) OR (NOT x AND NOT y);z <= (x NAND y) NAND (x OR y); --OAI
2
2+2i
2i
2+2i
2i
10
10
#Transistors
Footnote: (i=#inputs) We are only referring to CMOS static transistor ASIC gate designsExotic XOR designs can be done in 6 (J. W. Wang, IEEE J. Solid State Circuits, 29, July 1994)
Std_logic AND: Un-initialized value
AND 0 1 U
0 0 0 0
1 0 1 U
U 0 U U
OR 0 1 U
0 0 1 U
1 1 1 1
U U 1 U
0 AND <anything> is 0
0 NAND <anything> is 1
1 OR <anything> is 1
1 NOR <anything> is 0
NOT 0 1 U
1 0 U
Std_logic AND: X Forcing Unknown Value
AND 0 X 1 U
0 0 0 0 0
X 0 X X U
1 0 X 1 U
U 0 U U U
0 AND <anything> is 0
0 NAND <anything> is 1
OR 0 X 1 U
0 0 X 1 U
X X X 1 U
1 1 1 1 1
U U U 1 U
1 OR <anything> is 0
0 NOR <anything> is 1
NOT 0 X 1 U
1 X 0 U
Modeling logic gate values: std_ulogic
‘1’, -- Forcing 1
‘H’, -- Weak 1
‘L’, -- Weak 0
‘X’, -- Forcing Unknown: i.e. combining 0 and 1
‘0’, -- Forcing 0
‘U’, -- Un-initialized
‘W’, -- Weak Unknown: i.e. combining H and L
‘-’, -- Don’t care);
TYPE std_ulogic IS ( -- Unresolved LOGIC‘Z’, -- High Impedance (Tri-State)
Example: multiple driversExample: multiple drivers
01
11
1X
0
X
The rising transition signal
L
W
H
1> 3.85 Volts
Vcc=5.5 25°C
0< 1.65 Volts
Unknown2.20 Voltgap
Multiple output drivers: Resolution Function
U X 0 L Z W H 1 -
U U U U U U U U U U
X U X X X X X X X X
0 U X 0 0 0 0 0 X X
L U X 0 L L W W 1 X
Z U X 0 L Z W H 1 X
W U X 0 W W W W 1 X
H U X 0 W H W H 1 X
1 U X X 1 1 1 1 1 X
- U X X X X X X X X
Suppose that the first gate outputs a 1the second gate outputs a 0
thenthe mult-driver output is X X: forcing unknown value bycombining 1 and 0 together
Suppose that the first gate outputs a 1the second gate outputs a 0
thenthe mult-driver output is X X: forcing unknown value bycombining 1 and 0 together
Multiple output drivers: Resolution Function
U X 0 L Z W H 1 -
U U U U U U U U U U
X X X X X X X X X
0 0 0 0 0 0 X X
L L L W W 1 X
Z Z W H 1 X
W W W 1 X
H H 1 X
1 1 X
- X• Note the multi-driver resolution table is symmetrical
Observe that 0 pulls down all weak signals to 0
Observe that 0 pulls down all weak signals to 0
H <driving> L => WH <driving> L => W
Resolution Function: std_logic buffer gate
input: U 0 L W X Z H 1 -
output:U 0 0 X X X 1 1 X
0 or L becomes 00 or L becomes 0 H or 1 becomes 1H or 1 becomes 1
Transition zone becomes XTransition zone becomes X
1H
W, ZL
0
11
X0
0
std_logicstd_ulogic
Resolving input: std_logic AND GATE
Process each input as an unresolved to resolved buffer.
std_ulogic
std_ulogic
For example, let’s transform z <= ‘W’ AND ‘1’;
std_logicstd_logic
std_logic
Then process the gate as a standard logic gate { 0, X, 1, U }
z <= ‘W’ AND ‘1’; -- convert std_ulogic ‘W’ to std_logic ‘X’
W
1
z <= ‘X’ AND ‘1’; -- now compute the std_logic AND
X
1
z <= ‘X’;
X
2-to-1 Multiplexor: with-select-when
0
1
a
b
S
Y
a
b
Y
S
Y <= sa OR sb;
sa <= a AND NOT s;
sb <= b AND s;
Y <= sa OR sb;
sa <= a AND NOT s;
sb <= b AND s;
WITH s SELECTY <= a WHEN ‘0’,
b WHEN ‘1’;
WITH s SELECTY <= a WHEN ‘0’,
b WHEN ‘1’;
WITH s SELECTY <= a WHEN ‘0’,
b WHEN OTHERS;
WITH s SELECTY <= a WHEN ‘0’,
b WHEN OTHERS;
structural behavioral
Only values allowed
or alternatively
combinatorial logic
Only values allowed
4-to-1 Multiplexor: with-select-when
Y <= sa OR sb OR sc OR sd;
sa <= a AND ( NOT s(1) AND NOT s(0) );
sb <= b AND ( NOT s(1) AND s(0) );
sc <= c AND ( s(1) AND NOT s(0) );
sd <= d AND ( s(1) AND s(0) );
Y <= sa OR sb OR sc OR sd;
sa <= a AND ( NOT s(1) AND NOT s(0) );
sb <= b AND ( NOT s(1) AND s(0) );
sc <= c AND ( s(1) AND NOT s(0) );
sd <= d AND ( s(1) AND s(0) );
WITH s SELECTY <= a WHEN “00”,
b WHEN “01”,c WHEN “10”,d WHEN OTHERS;
WITH s SELECTY <= a WHEN “00”,
b WHEN “01”,c WHEN “10”,d WHEN OTHERS;
a
b
c
d
S
Y
00
01
10
11
As the complexity of the combinatorial logic grows,the SELECT statement, simplifies logic designbut at a loss of structural information
Structural Combinatorial logic
behavioral
Tri-State bufferoe
yx
ENTITY Buffer_Tri_State ISPORT(x: IN std_logic;
y: OUT std_logic;oe: IN std_logic
); END;
ENTITY Buffer_Tri_State ISPORT(x: IN std_logic;
y: OUT std_logic;oe: IN std_logic
); END;
ARCHITECTURE Buffer3 OF Buffer_Tri_State ISBEGIN
WITH oe SELECTy <= x WHEN ‘1’, -- Enabled: y <= x;
‘Z’ WHEN ‘0’; -- Disabled: output a tri-state
END;
Assignment #2 (Part 1 of 3) Due Thurs, 9/14
1) Assume each gate is 10 ns delay for the above circuit.
(a) Write entity-architecture for a inertial model(b) Given the following waveform, draw, R, S, Q, NQ (inertial)
R <= ‘0’, ‘1’ after 25 ns, ‘0’ after 30 ns;S <= ‘1’, ‘0’ after 20 ns, ‘1’ after 35 ns, ‘0’ after 50 ns;
(c) Write entity-architecture for a transport model(d) Given the waveform in (b) draw, R, S, Q, NQ (transport)
Assignment #2 (Part 2 of 3)
X
F
Y
a
b
(2) Given the above two tri-state buffers connected together( assume transport model of 5ns per gate), draw X, Y, F, a, b, G for the following input waveforms:
X <= ‘1’, ‘0’ after 10 ns, ‘1’ after 20 ns, ‘L’ after 30 ns, ‘1’ after 40 ns;Y <= ‘0’, ‘L’ after 10 ns, ‘W’ after 20 ns, ‘Z’ after 30 ns, 0 after 40 ns;F <= ‘0’, ‘1’ after 10 ns, ‘0’ after 50 ns;
G
Assignment #2 (Part 3 of 3)3a) Write (no programming) a entity-architecture for a 1-bitALU. The input will consist of x, y, Cin, f and the output will be S and Cout. Use as many sub-components as possible. The input function f will enable the following operations:
function f ALU bit operation000 S = 0 Cout = 0001 S = x010 S = y011 S = x AND y100 S = x OR y101 S = x XOR y110 (Cout, S) = x + y + Cin;111 (Cout, S) = full subtractor
3b) Calculate the number of transistors for the 1-bit ALU3c) Write a entity-architecture for a N-bit ALU (for-generate)
x ALUyCin f
SCout
Module 5:AOIs,WITH-SELECT-WHEN,WHEN-ELSE
Module 5:AOIs,WITH-SELECT-WHEN,WHEN-ELSE
DeMorgan’s laws: review
X Y = X + Y
X Y = X + Y
X + Y = X Y
X + Y = X Y
General Rule: 1. Exchange the AND with OR2. Invert the NOTs
CMOS logic gate: review
4 transistors 4 transistors 2 transistors
CMOS logic gate: layout sizes (1X output drive)
AOI: AND-OR-Invert gates
• Suppose you want to transform a circuit to all nands & nots16 transistors6
64
4
4
2
24
4
44 2
Final 14 TransistorsFinal 14 Transistors
AOI: AND-OR-Invert gates
• AOIs provide a way at the gate level to use less transistorsthan separate ANDs and a NORs
• ASIC design logic builds upon a standard logic cell library,therefore, do not optimize transistors only logic gates
• For example, 2-wide 2-input AOI will only use 8 transistors
• Whereas 2 ANDs (12 transistors) and 1 NOR (4 transistors)will use a total of 16 transistors {14 by DeMorgans law}
4
44 2
• Although, there were no tricks to make AND gates better
AOI: AND-OR-Invert cmos 2x2 example
• For example, 2-wide 2-input AOI (2x2 AOI)O <= NOT((D1 AND C1) NOR (B1 AND A1));
AOI: AND-OR-Invert cmos 2x2 example
• This means AOIs use less chip area, less power, and delay
AOI: other Standard Cell examples
AOI22 Cell: 2x2 AOI (8 transistors)Y <= (A AND B) NOR (C AND D);
AOI23 Cell: 2x3 AOI (10 transistors)Y <= (A AND B) NOR (C AND D AND E);
AOI21 Cell: 2x1 AOI (6 transistors)Y <= (A AND B) NOR C;
Total transistors = 2 times # inputs
AOI: XOR implementation
The XOR is not as easy as it appears
Y <= NOT( (A AND B) OR (NOT B AND NOT A));
8
8
6
This design uses 22 transistors Y <= (A AND NOT B) OR (NOT B AND A);
Y <= NOT( A XNOR B);
6
8
4This newer design uses 18 transistors
But wait, we can exploit the AOI22 structurenow we have 4+4+2+2=12 transistors
Y <= NOT( (A AND B) OR (B NOR A) );4
4 2The total of transistors is now 10
Finally, by applying DeMorgan’s law
OAI: Or-And-Invert
• Or-And-Inverts are dual of the AOIs
with-select-when: 2-to-1 Multiplexor
0
1
a
b
S
Y
a
b
Y
S
Y <= (a AND NOT s)OR
(b AND s);
Y <= (a AND NOT s)OR
(b AND s);
WITH s SELECTY <= a WHEN ‘0’,
b WHEN ‘1’;
WITH s SELECTY <= a WHEN ‘0’,
b WHEN ‘1’;
WITH s SELECTY <= a WHEN ‘0’,
b WHEN OTHERS;
WITH s SELECTY <= a WHEN ‘0’,
b WHEN OTHERS;
structural behavioral
Only values allowed
or alternatively
combinatorial logic
Only values allowed
6
6
62
20 Transistors
with-select-when: 2 to 4-line Decoder
WITH S SELECTY <= “1000” WHEN “11”,
“0100” WHEN “10”,“0010” WHEN “01”,“0001” WHEN OTHERS;
WITH S SELECTY <= “1000” WHEN “11”,
“0100” WHEN “10”,“0010” WHEN “01”,“0001” WHEN OTHERS;
Y1
Y0
Y2
Y3
S0
S1
SIGNAL S: std_logic_vector(1 downto 0);
SIGNAL Y: std_logic_vector(3 downto 0);
SIGNAL S: std_logic_vector(1 downto 0);
SIGNAL Y: std_logic_vector(3 downto 0);
S1 S0
Y1
Y0
Y2
Y36
8
8
10
32 Transistors
Replace this with a NOR, then 26 total transistors
Replace this with a NOR, then 26 total transistors
ROM: 4 byte Read Only Memory
Y1
Y0
Y2
Y3
A0
A1
D7 D6 D5 D4 D3 D2 D1 D0
OE
4 byte by 8 bit ROM ARRAY
ROM: 4 byte Read Only Memory
ENTITY rom_4x8 ISPORT(A: IN std_logic_vector(1 downto 0);
OE: IN std_logic; -- Tri-State OutputD: OUT std_logic_vector(7 downto 0)
); END;
ENTITY rom_4x8 ISPORT(A: IN std_logic_vector(1 downto 0);
OE: IN std_logic; -- Tri-State OutputD: OUT std_logic_vector(7 downto 0)
); END;
ARCHITECTURE rom_4x8_arch OF rom_4x8 ISSIGNAL ROMout: std_logic_vector(7 downto 0);
BEGINBufferOut: TriStateBuffer GENERIC MAP(8)
PORT MAP(D, ROMout, OE); WITH A SELECT
ROMout <= “01000001” WHEN “00”,“11111011” WHEN “01”,“00000110” WHEN “10”,“00000000” WHEN “11”;
ARCHITECTURE rom_4x8_arch OF rom_4x8 ISSIGNAL ROMout: std_logic_vector(7 downto 0);
BEGINBufferOut: TriStateBuffer GENERIC MAP(8)
PORT MAP(D, ROMout, OE); WITH A SELECT
ROMout <= “01000001” WHEN “00”,“11111011” WHEN “01”,“00000110” WHEN “10”,“00000000” WHEN “11”;
when-else: 2-to-1 Multiplexor
0
1
a
b
S
Y
WITH s SELECTY <= a WHEN ‘0’,
b WHEN ‘1’;
WITH s SELECTY <= a WHEN ‘0’,
b WHEN ‘1’;
WITH s SELECTY <= a WHEN ‘0’,
b WHEN OTHERS;
WITH s SELECTY <= a WHEN ‘0’,
b WHEN OTHERS;
or alternatively
Y <= a WHEN s = ‘0’ ELSEb WHEN s = ‘1’;
Y <= a WHEN s = ‘0’ ELSEb WHEN s = ‘1’;
Y <= a WHEN s = ‘0’ ELSEb;
Y <= a WHEN s = ‘0’ ELSEb;
WHEN-ELSE condition allows a condition as part of the WHEN
whereas the WITH-SELECT only allows only a value as part of the WHEN.
WHEN-ELSE condition allows a condition as part of the WHEN
whereas the WITH-SELECT only allows only a value as part of the WHEN.
with-select-when: 4-to-1 Multiplexor
WITH s SELECTY <= a WHEN “00”,
b WHEN “01”,c WHEN “10”,d WHEN OTHERS;
WITH s SELECTY <= a WHEN “00”,
b WHEN “01”,c WHEN “10”,d WHEN OTHERS;
a
b
c
d
S
Y
00
01
10
11
Y <=a WHEN s = “00” ELSEb WHEN s = “01” ELSEc WHEN s = “10” ELSEd ;
Y <=a WHEN s = “00” ELSEb WHEN s = “01” ELSEc WHEN s = “10” ELSEd ;
As long as each WHEN-ELSE condition is mutually exclusive,
then it is equivalent to the WITH-SELECT statement.
As long as each WHEN-ELSE condition is mutually exclusive,
then it is equivalent to the WITH-SELECT statement.
when-else: 2-level priority selector
Y <= a WHEN s(1) = ‘1’ELSE
b WHEN s(0) = ‘1’ELSE
‘0’;
Y <= a WHEN s(1) = ‘1’ELSE
b WHEN s(0) = ‘1’ELSE
‘0’;
WITH s SELECTY <= a WHEN “11”,
a WHEN “10”,b WHEN “01”,‘0’ WHEN OTHERS;
WITH s SELECTY <= a WHEN “11”,
a WHEN “10”,b WHEN “01”,‘0’ WHEN OTHERS;
a
b
Y
S1 S0
WHEN-ELSE are useful for sequential or priority encoders
WITH-SELECT-WHEN are useful for parallel or multiplexors
WHEN-ELSE are useful for sequential or priority encoders
WITH-SELECT-WHEN are useful for parallel or multiplexors
6
10
6
22 Transistors
when-else: 3-level priority selector
Y <= a WHEN s(2) = ‘1’ ELSEb WHEN s(1) = ‘1’ ELSEc WHEN s(0) = ‘1’ ELSE‘0’;
Y <= a WHEN s(2) = ‘1’ ELSEb WHEN s(1) = ‘1’ ELSEc WHEN s(0) = ‘1’ ELSE‘0’;
WITH s SELECTY <= a WHEN “111”,
a WHEN “110”,a WHEN “101”,a WHEN “100”,b WHEN “011”,b WHEN “010”,c WHEN “001”,‘0’ WHEN OTHERS;
WITH s SELECTY <= a WHEN “111”,
a WHEN “110”,a WHEN “101”,a WHEN “100”,b WHEN “011”,b WHEN “010”,c WHEN “001”,‘0’ WHEN OTHERS;
a
b
c
Y
S1 S0S2
6
108
22 Transistors
14
when-else: 2-Bit Priority Encoder (~74LS148)
I1I0
I2A0
A1
GSI3
• Priority encoders are typically used as interrupt controllers
• The example below is based on the 74LS148
I3 I2 I1 I0 GS A1 A0
0 X X X 0 0 01 0 X X 0 0 11 1 0 X 0 1 01 1 1 0 0 1 11 1 1 1 1 1 1
I3 I2 I1 I0 GS A1 A0
0 X X X 0 0 01 0 X X 0 0 11 1 0 X 0 1 01 1 1 0 0 1 11 1 1 1 1 1 1
when-else: 2-Bit Priority Encoder (~74LS148)
I1I0
I2A0
A1
GSI3
A <= “00” WHEN I3 = 0 ELSE“01” WHEN I2 = 0 ELSE“10” WHEN I1 = 0 ELSE“11” WHEN I0 = 0 ELSE“11” WHEN OTHERS;
A <= “00” WHEN I3 = 0 ELSE“01” WHEN I2 = 0 ELSE“10” WHEN I1 = 0 ELSE“11” WHEN I0 = 0 ELSE“11” WHEN OTHERS;
ENTITY PriEn2 IS PORT(I: IN std_logic_vector(3 downto 0);GS: OUT std_logic;A: OUT std_logic_vector(1 downto 0);); END;
ENTITY PriEn2 IS PORT(I: IN std_logic_vector(3 downto 0);GS: OUT std_logic;A: OUT std_logic_vector(1 downto 0);); END;
I3 I2 I1 I0 GS A1 A0
0 X X X 0 0 01 0 X X 0 0 11 1 0 X 0 1 01 1 1 0 0 1 11 1 1 1 1 1 1
I3 I2 I1 I0 GS A1 A0
0 X X X 0 0 01 0 X X 0 0 11 1 0 X 0 1 01 1 1 0 0 1 11 1 1 1 1 1 1
when-else: 2-Bit Priority Encoder (~74LS148)
I1I0
I2A0
A1
GSI3
I3 I2 I1 I0 GS A1 A0
0 X X X 0 0 01 0 X X 0 0 11 1 0 X 0 1 01 1 1 0 0 1 11 1 1 1 1 1 1
I3 I2 I1 I0 GS A1 A0
0 X X X 0 0 01 0 X X 0 0 11 1 0 X 0 1 01 1 1 0 0 1 11 1 1 1 1 1 1
GS <= NOT( NOT(I3) OR NOT(I2)OR NOT(I1) OR NOT(I0) )
GS <= NOT( NOT(I3) OR NOT(I2)OR NOT(I1) OR NOT(I0) )
Structural model
GS <= WITH I SELECT‘1’ WHEN “1111”,‘0’ WHEN OTHERS;
GS <= WITH I SELECT‘1’ WHEN “1111”,‘0’ WHEN OTHERS;
Behavioral model
GS <= I3 AND I2 AND I1 AND I0GS <= I3 AND I2 AND I1 AND I0
Structural model
Module 6:State machines
Module 6:State machines
VHDL Component, Entity, and Architecture
Entity
Architecturei
OtherConcurrentComponents
ConcurrentBoolean Equations
Component Instance
Component Declaration
for-generate | if generate
ConcurrentWith-Select-When
When-Else
VHDL ComponentsComponent Declaration
COMPONENT component_entity_name[ GENERIC ( { identifier: type [:= initial_value ]; } ) ][ PORT ( { identifier: mode type; } ) ]
END;
[ Optional ] { repeat }
Component Instance
identifier : component_entity_name[ GENERIC MAP ( identifier { ,identifier } ) ][ PORT MAP ( identifier { ,identifier } ) ]
;
mode := IN | OUT | INOUTtype := std_logic | std_logic_vector(n downto 0) | bit
Add ; only if another identifierAdd ; only if another identifier
VHDL Concurrent Statements
Boolean Equationsrelation ::= relation LOGIC relation | NOT relation | ( relation )
LOGIC ::= AND | OR | XOR | NAND | NOR | XNORExample: y <= NOT ( NOT (a) AND NOT (b) )
Multiplexor case statementWITH select_signal SELECT
signal <= signal_value1 WHEN select_compare1,• • •WHEN select_comparen;
Example: 2 to 1 multiplexorWITH s SELECT y <= a WHEN ‘0’, b WHEN OTHERS;
VHDL Concurrent Statements
Conditionial signal assignment
signal <= signal_value1 WHEN condition1 ELSE• • •
signal_valuen WHEN conditionn; ELSEsignal_valuen+1
Example: Priority Encodery <= a WHEN s=‘0’ ELSE b;
SR Flip-Flop (Latch)
R
S
Q
Q
NANDR S Qn+10 0 U0 1 11 0 01 1 Qn
R
S
Q
Q
NORR S Qn+10 0 Qn0 1 11 0 01 1 U
Q <= R NOR NQ;NQ <= S NOR Q;
Q <= R NAND NQ;NQ <= S NAND Q;
SR Flip-Flop (Latch)
NANDR S Qn+10 0 U0 1 11 0 01 1 Qn
R
S
Q
Q
R(t)Q(t)
S(t)Q(t) Q(t + 5ns)
Q(t + 5ns)5ns
5ns
With Delay
Example: R <= ‘1’, ‘0’ after 10ns, ‘1’ after 30ns; S <= ‘1’;
t 0 5ns 10ns 15ns 20ns 25ns 30ns 35ns 40ns
R 1 1 0 0 0 0 1 1 1Q U U U U 0 0 0 0 0
Q U U U 1 1 1 1 1 1S 1 1 1 1 1 1 1 1 1
Gated-Clock SR Flip-Flop (Latch Enable)
S
R
Q
QLE
Q <= (S NAND LE) NAND NQ;
Asynchronous:Preset and Clear
Synchronous:Set and Reset
NQ <= (R NAND LE) NAND Q;
CLR
PS
Suppose each gate was 5ns: how long does the clockhave to be enabled to latch the data?
Answer: 15ns
Latches require that during the gated-clock the data must also be stable (i.e. S and R) at the same time
Rising-Edge Flip-flop
Rising-Edge Flip-flop logic diagram
Synchronous Sequential Circuit
Abstraction: Finite State Machine
FSM Representations
Simple Design Example
State Encoding
Logic Implementations
FSM Observations
Coke Machine Example
Coke Machine State Diagram
Coke Machine Diagram II
Moore Machines
Mealy Machines
Module 7:Multicycle CPU
Module 7:Multicycle CPU
MIPS instructions
ALU alu $rd,$rs,$rt $rd = $rs <alu> $rtALU alu $rd,$rs,$rt $rd = $rs <alu> $rt
Data lw $rt,offset($rs) $rt = Mem[$rs + offset]Transfer sw $rt,offset($rs) Mem[$rs + offset] = $rtData lw $rt,offset($rs) $rt = Mem[$rs + offset]Transfer sw $rt,offset($rs) Mem[$rs + offset] = $rt
Branch beq $rs,$rt,offset$pc = ($rd == $rs)? (pc+4+offset):(pc+4);
Branch beq $rs,$rt,offset$pc = ($rd == $rs)? (pc+4+offset):(pc+4);
Jump j address pc = addressJump j address pc = address
ALUi alui $rd,$rs,value $rd = $rs <alu> valueALUi alui $rd,$rs,value $rd = $rs <alu> value
MIPS fixed sized instruction formats
Data lw $rt,offset($rs)Transfer sw $rt,offset($rs)Data lw $rt,offset($rs)Transfer sw $rt,offset($rs)op rs rt value or offset
Branch beq $rs,$rt,offsetBranch beq $rs,$rt,offset
ALUi alui $rt,$rs,valueALUi alui $rt,$rs,valueI - Format
op absolute address Jump j addressJump j address
J - Format
ALU alu $rd,$rs,$rtALU alu $rd,$rs,$rt
R - Format
op rs rt rd shamt func
Assembling Instructions
op rs rt rd shamt func ALU alu $rd,$rs,$rtALU alu $rd,$rs,$rt
0x00400020 addu $23, $0, $31
Suppose there are 32 registers, addu opcode=001001, addi op=001000
001001:00000:11111:10111:00000:000000
0x00400024 addi $17, $0, 5
op rs rt value or offset ALUi alui $rt,$rs,valueALUi alui $rt,$rs,value
001000:00000:00101:0000000000000101
Byte Halfword Word
Registers
Memory
Memory
Word
Memory
Word
Register
Register
1. Immediate addressing
2. Register addressing
3. Base addressing
4. PC-relative addressing
5. Pseudodirect addressing
op rs rt
op rs rt
op rs rt
op
op
rs rt
Address
Address
Address
rd . . . funct
Immediate
PC
+
+
Arithmeticaddi $rt, $rs, value
add $rd,$rs,$rt
Data Transferlw $rt,offset($rs)sw $rt,offset($rs)
Conditional branchbeq $rs,$rt,offset
Unconditional jumpj address
MIPS instruction formats
MIPS registers and conventions
Name Number Conventional usage$0 0 Constant 0$v0-$v1 2-3 Expression evaluation & function return$a0-$a3 4-7 Arguments 1 to 4$t0-$t9 8-15,24,35 Temporary (not preserved across call)$s0-$s7 16-23 Saved Temporary (preserved across call)$k0-$k1 26-27 Reserved for OS kernel$gp 28 Pointer to global area$sp 29 Stack pointer$fp 30 Frame pointer$ra 31 Return address (used by function call)
C function to MIPS Assembly Languageint power_2(int y) { /* compute x=2^y; */
register int x, i; x=1; i=0; while(i<y) { x=x*2; i=i+1; }return x;
}Assember .s Comments
addi $t0, $0, 1 # x=1;addu $t1, $0, $0 # i=0;
w1: bge $t1,$a0,w2 # while(i<y) { /* bge= greater or equal */
addu $t0, $t0, $t0 # x = x * 2; /* same as x=x+x; */addi $t1,$t1,1 # i = i + 1;beq $0,$0,w1 # }
w2: addu $v0,$0,$t0 # return x;jr $ra # jump on register ( pc = ra; )
Exit condition of a while loop isif ( i >= y ) then goto w2
Exit condition of a while loop isif ( i >= y ) then goto w2
.text0x00400020 addi $8, $0, 1 # addi$t0, $0, 1
0x00400024 addu $9, $0, $0 # addu $t1, $0, $0
0x00400028 bge $9, $4, 2 # bge $t1, $a0, w2
0x0040002c addu $8, $8, $8 # addi$t0, $t0, $t0
0x00400030 addi $9, $9, 1 # addi $t1, $t1, 1
0x00400034 beq $0, $0, -3 # beq $0, $0, w1
0x00400038 addu $2, $0, $8 # addu $v0, $0, $t0
0x0040003c jr $31 # jr $ra
Power_2.s: MIPS storage assignment
2 words after pc fetch
after bgefetch pc is 0x00400030plus 2 words is 0x00400038
2 words after pc fetch
after bgefetch pc is 0x00400030plus 2 words is 0x00400038
Byte address, not word addressByte address, not word address
Machine Language Single Stepping
Values changes after the instruction!Values changes after the instruction!
00400024 ? 0 1 ? 700018 addu $t1, $0, $0
00400028 ? 0 1 0 700018 bge $t1,$a0,w2
00400038 ? 0 1 0 700018 add $v0,$0,$t0
Assume power2(0); is called; then $a0=0 and $ra=700018
$pc $v0 $a0 $t0 $t1 $ra$2 $4 $8 $9 $31
00400020 ? 0 ? ? 700018 addi $t0, $0, 1
00700018 ? 0 1 0 700018 …0040003c 1 0 1 0 700018 jr $ra
Harvard architecture was coined to describe machines with separate memories.Speed efficient: Increased parallelism.
instructions data
ALU I/OALU I/O
instructions
and
data
Data busAddress bus
Von Neuman architectureArea efficient but requires higher bus bandwidth because instructions and data must compete for memory.
Von Neuman & Harvard CPU Architectures
Shift left 2
MemtoReg
IorD MemRead MemWrite
PC
MemoryMemData
Write data
M u x
0
1
RegistersWrite register
Write data
Read data 1
Read data 2
Read register 1
Read register 2
Instruction [15– 11]
M u x
0
1
M u x
0
1
4
ALUOpALUSrcB
RegDst RegWrite
Instruction [15– 0]
Instruction [5– 0]
Sign extend
3216
Instruction [25– 21]
Instruction [20– 16]
Instruction [15– 0]
Instruction register
1 M u x
0
32
ALU control
M u x
0
1ALU
resultALU
ALUSrcA
ZeroA
B
ALUOut
IRWrite
Address
Memory data
register
Multi-cycle Processor Datapath
Shift left 2
PCM u x
0
1
RegistersWrite register
Write data
Read data 1
Read data 2
Read register 1
Read register 2
Instruction [15– 11]
M u x
0
1
M u x
0
1
4
Instruction [15– 0]
Sign extend
3216
Instruction [25– 21]
Instruction [20– 16]
Instruction [15– 0]
Instruction register
ALU control
ALU result
ALUZero
Memory data
register
A
B
IorD
MemRead
MemWrite
MemtoReg
PCWriteCond
PCWrite
IRWrite
ALUOp
ALUSrcB
ALUSrcA
RegDst
PCSource
RegWriteControl
Outputs
Op [5– 0]
Instruction [31-26]
Instruction [5– 0]
M u x
0
2
Jump address [31-0]Instruction [25– 0] 26 28
Shift left 2
PC [31-28]
1
1 M u x
0
32
M u x
0
1ALUOut
MemoryMemData
Write data
Address
Multi-cycle Datapath: with controller
Datapath control outputs
State registerInputs from instructionregister opcode field
Outputs
Combinationalcontrol logic
Inputs
Next state
Finite State Machine( hardwired control )
Multi-cycle using Finite State Machine
Finite State Machine: program overview
Mem1Rformat1 BEQ1 JUMP1
Fetch
Decode
LW2 SW2
LW2+1
Rformat1+1
T1
T2
T3
T4
T5
The Four Stages of R-Format
• Fetch: • Fetch the instruction from the Instruction Memory
• Decode:• Registers Fetch and Instruction Decode
• Exec: ALU• ALU operates on the two register operands• Update PC
• Write: Reg• Write the ALU output back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Ifetch Reg/Dec Exec WrR-Format
R-Format State Machine
Decode
Fetch Exec
Write
Clock=1 Clock=1
Clock=1Clock=1
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Ifetch Reg/Dec Exec WrR-Format
The Five Stages of Load Instruction
• Fetch: •Fetch the instruction from the Instruction Memory
• Decode:• Registers Fetch and Instruction Decode
• Exec: Offset•Calculate the memory offset
• Mem: •Read the data from the Data Memory
• Wr: •Write the data back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Ifetch Reg/Dec Exec Mem WrLoad
R-Format & I-Format State Machine
Decode
FetchExec ALU
Write Reg
Clock=1
Clock=1 AND R-Format=1
Clock=1Clock=1
ExecOffset
Clock=1 AND I-Format=1
Mem Read
WriteReg
Clock=1 AND opcode=LW
Clock=1 Clock=1
Need to check instruction formatNeed to check instruction format
Need to check opcode
Need to check opcode
Multi-Instruction sequence
Clk
Cycle 1
Ifetch Reg Exec Mem Wr
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Ifetch Reg Exec MemLoad Store
IfetchR-type
Shift left 2
MemtoReg
IorD MemRead MemWrite
PC
MemoryMemData
Write data
M u x
0
1
RegistersWrite register
Write data
Read data 1
Read data 2
Read register 1
Read register 2
Instruction [15– 11]
M u x
0
1
M u x
0
1
4
ALUOpALUSrcB
RegDst RegWrite
Instruction [15– 0]
Instruction [5– 0]
Sign extend
3216
Instruction [25– 21]
Instruction [20– 16]
Instruction [15– 0]
Instruction register
1 M u x
0
32
ALU control
M u x
0
1ALU
resultALU
ALUSrcA
ZeroA
B
ALUOut
IRWrite
Address
Memory data
register
State machine stepping: T1 Fetch
(Done in parallel) IR←MEMORY[PC] & PC ← PC + 4
IRPC
T1 Fetch: State machine
MemRead=1, MemWrite=0IorD=1 (MemAddr←PC)IRWrite=1 (IR←Mem[PC])ALUSrcA=0 (=PC)ALUSrcB=1 (=4)ALUOP=ADD (PC←4+PC)PCWrite=1, PCSource=1 (=ALU)RegWrite=0, MemtoReg=X, RegDst=X
MemRead=1, MemWrite=0IorD=1 (MemAddr←PC)IRWrite=1 (IR←Mem[PC])ALUSrcA=0 (=PC)ALUSrcB=1 (=4)ALUOP=ADD (PC←4+PC)PCWrite=1, PCSource=1 (=ALU)RegWrite=0, MemtoReg=X, RegDst=X
Start
Instruction Fetch
Decode
Exec
Write Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Ifetch Reg/Dec Exec WrR-Format
Shift left 2
MemtoReg
IorD MemRead MemWrite
PC
MemoryMemData
Write data
M u x
0
1
RegistersWrite register
Write data
Read data 1
Read data 2
Read register 1
Read register 2
Instruction [15– 11]
M u x
0
1
M u x
0
1
4
ALUOpALUSrcB
RegDst RegWrite
Instruction [15– 0]
Instruction [5– 0]
Sign extend
3216
Instruction [25– 21]
Instruction [20– 16]
Instruction [15– 0]
Instruction register
1 M u x
0
32
ALU control
M u x
0
1ALU
resultALU
ALUSrcA
ZeroA
B
ALUOut
IRWrite
Address
Memory data
register
T2 Decode (read $rs and $rt and offset+pc)
A←Reg[IR[25-21]] & B←Reg[IR[20-16]]
$rs$rt
offset
PC& ALUOut←PC+signext(IR[15-0]) <<2
MemRead=0, MemWrite=0IorD=XIRWrite=0ALUSrcA=0 (=PC)ALUSrcB=3 (=signext(IR<<2))ALUOP=0 (=add)PCWrite=0, PCSource=XRegWrite=0, MemtoReg=X, RegDst=X
MemRead=0, MemWrite=0IorD=XIRWrite=0ALUSrcA=0 (=PC)ALUSrcB=3 (=signext(IR<<2))ALUOP=0 (=add)PCWrite=0, PCSource=XRegWrite=0, MemtoReg=X, RegDst=X
Instr. Decode & Register Fetch
Start
T2 Decode State machine
ExecWrite Reg
Fetch
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Ifetch Reg/Dec Exec WrR-Format
Shift left 2
MemtoReg
IorD MemRead MemWrite
PC
MemoryMemData
Write data
M u x
0
1
RegistersWrite register
Write data
Read data 1
Read data 2
Read register 1
Read register 2
Instruction [15– 11]
M u x
0
1
M u x
0
1
4
ALUOpALUSrcB
RegDst RegWrite
Instruction [15– 0]
Instruction [5– 0]
Sign extend
3216
Instruction [25– 21]
Instruction [20– 16]
Instruction [15– 0]
Instruction register
1 M u x
0
32
ALU control
M u x
0
1ALU
resultALU
ALUSrcA
ZeroA
B
ALUOut
IRWrite
Address
Memory data
register
T3 ExecALU (ALU instruction)
op(IR[31-26])
ALUOut ← A op(IR[31-26]) B
MemRead=0, MemWrite=0IorD=XIRWrite=0ALUSrcA=1 (=A =Reg[$rs])ALUSrcB=0 (=B =Reg[$rt])ALUOP=2 (=IR[28-26])PCWrite=0, PCSource=XRegWrite=0, MemtoReg=X, RegDst=X
MemRead=0, MemWrite=0IorD=XIRWrite=0ALUSrcA=1 (=A =Reg[$rs])ALUSrcB=0 (=B =Reg[$rt])ALUOP=2 (=IR[28-26])PCWrite=0, PCSource=XRegWrite=0, MemtoReg=X, RegDst=X
R-Format Execution
Start
T3 ExecALU State machine
Write Reg
Decode
Fetch
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Ifetch Reg/Dec Exec WrR-Format
Shift left 2
MemtoReg
IorD MemRead MemWrite
PC
MemoryMemData
Write data
M u x
0
1
RegistersWrite register
Write data
Read data 1
Read data 2
Read register 1
Read register 2
Instruction [15– 11]
M u x
0
1
M u x
0
1
4
ALUOpALUSrcB
RegDst RegWrite
Instruction [15– 0]
Instruction [5– 0]
Sign extend
3216
Instruction [25– 21]
Instruction [20– 16]
Instruction [15– 0]
Instruction register
1 M u x
0
32
ALU control
M u x
0
1ALU
resultALU
ALUSrcA
ZeroA
B
ALUOut
IRWrite
Address
Memory data
register
T4 WrReg (ALU instruction)
Reg[ IR[15-11] ] ← ALUOut
MemRead=0, MemWrite=0IorD=XIRWrite=0ALUSrcA=XALUSrcB=XALUOP=X PCWrite=0, PCSource=XRegWrite=1, (Reg[$rd] ←ALUout) MemtoReg=0, (=ALUout)RegDst=1 (=$rd)
MemRead=0, MemWrite=0IorD=XIRWrite=0ALUSrcA=XALUSrcB=XALUOP=X PCWrite=0, PCSource=XRegWrite=1, (Reg[$rd] ←ALUout) MemtoReg=0, (=ALUout)RegDst=1 (=$rd)
R-Format Write Register
Start
T4 WrReg State machine
DecodeFetch
Exec
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Ifetch Reg/Dec Exec WrR-Format
Review Moore Machine
Next State
Processor Control Lines
MemReadMemWrite
IorD
...
IR[31-0]
Moore Output State Tables: O(State)
T1
1
0
0 =PC
1
0 +
0 =PC
1 =4
1
0 =ALU
0
X
X
T2
0
0
X
0
0 +
0 =PC
3 =offset
0
X
0
X
X
T3-R0
0
X
0
2 =op
1 =A =$rs
0 =B =$rt
0
X
0
X
X
T4-R0
0
X
0
X
X
X
0
X
1
0 =ALUOut
1 =$rd
State
MemRead
MemWrite
MUX IorD
IRWrite
ALUOP
MUX ALUSrcA
MUX ALUSrcB
PCWrite
MUX PCSource
RegWrite
MUX MemtoReg
MUX RegDst
Review: The Five Stages of Load Instruction
• Fetch: •Fetch the instruction from the Instruction Memory
• Decode:• Registers Fetch and Instruction Decode
• Exec: Offset•Calculate the memory offset
• Mem: •Read the data from the Data Memory
• Wr: •Write the data back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Ifetch Reg/Dec Exec Mem WrLoad
Review: R-Format & I-Format State Machine
Decode
FetchExec ALU
Write Reg
Clock=1
Clock=1 AND R-Format=1
Clock=1Clock=1
ExecOffset
Clock=1 AND I-Format=1
Mem Read
WriteReg
Clock=1 AND opcode=LW
Clock=1 Clock=1
Need to check instruction formatNeed to check instruction format
Need to check opcode
Need to check opcode
Shift left 2
MemtoReg
IorD MemRead MemWrite
PC
MemoryMemData
Write data
M u x
0
1
RegistersWrite register
Write data
Read data 1
Read data 2
Read register 1
Read register 2
Instruction [15– 11]
M u x
0
1
M u x
0
1
4
ALUOpALUSrcB
RegDst RegWrite
Instruction [15– 0]
Instruction [5– 0]
Sign extend
3216
Instruction [25– 21]
Instruction [20– 16]
Instruction [15– 0]
Instruction register
1 M u x
0
32
ALU control
M u x
0
1ALU
resultALU
ALUSrcA
ZeroA
B
ALUOut
IRWrite
Address
Memory data
register
T3–I Mem1 (common to both load & store)
ALUOut ← A + sign_extend(IR[15-0])
T3 Mem1 I-Format State Machine =$rs + offset
Decode
FetchExec ALU
Write Reg
Clock=1 AND R-Format=1
Clock=1 AND I-Format=1
Mem Read
WriteReg
Clock=1 AND opcode=LW
MemRead=0, MemWrite=0IorD=XIRWrite=0ALUOP=0 +ALUSrcA=1 =A =Reg[$rs]ALUSrcB=2 =signext(IR[15-0])PCWrite=0, PCSource=XRegWrite=0, MemtoReg=X, RegDst=X
MemRead=0, MemWrite=0IorD=XIRWrite=0ALUOP=0 +ALUSrcA=1 =A =Reg[$rs]ALUSrcB=2 =signext(IR[15-0])PCWrite=0, PCSource=XRegWrite=0, MemtoReg=X, RegDst=X
I-Format ExecutionALUout=$rs+offset
Shift left 2
MemtoReg
IorD MemRead MemWrite
PC
MemoryMemData
Write data
M u x
0
1
RegistersWrite register
Write data
Read data 1
Read data 2
Read register 1
Read register 2
Instruction [15– 11]
M u x
0
1
M u x
0
1
4
ALUOpALUSrcB
RegDst RegWrite
Instruction [15– 0]
Instruction [5– 0]
Sign extend
3216
Instruction [25– 21]
Instruction [20– 16]
Instruction [15– 0]
Instruction register
1 M u x
0
32
ALU control
M u x
0
1ALU
resultALU
ALUSrcA
ZeroA
B
ALUOut
IRWrite
Address
Memory data
register
T4 –LW1 : load instruction, read memory
MDR ← Memory[ALUOut]
T4 LW2 I-Format State Machine =Mem[ALU]
Decode
FetchExec ALU
Write Reg
Clock=1 AND R-Format=1
Clock=1 AND I-Format=1
WriteReg
Clock=1 AND opcode=LW
MemRead=1, MemWrite=0IorD=1IRWrite=0ALUOP=XALUSrcA=XALUSrcB=XPCWrite=0, PCSource=XRegWrite=0, MemtoReg=X, RegDst=X
MemRead=1, MemWrite=0IorD=1IRWrite=0ALUOP=XALUSrcA=XALUSrcB=XPCWrite=0, PCSource=XRegWrite=0, MemtoReg=X, RegDst=X
I-Format Memory Read
ExecOffset
Clock=1 AND opcode=LW
Shift left 2
MemtoReg
IorD MemRead MemWrite
PC
MemoryMemData
Write data
M u x
0
1
RegistersWrite register
Write data
Read data 1
Read data 2
Read register 1
Read register 2
Instruction [15– 11]
M u x
0
1
M u x
0
1
4
ALUOpALUSrcB
RegDst RegWrite
Instruction [15– 0]
Instruction [5– 0]
Sign extend
3216
Instruction [25– 21]
Instruction [20– 16]
Instruction [15– 0]
Instruction register
1 M u x
0
32
ALU control
M u x
0
1ALU
resultALU
ALUSrcA
ZeroA
B
ALUOut
IRWrite
Address
Memory data
register
T5 –LW2 Load instruction, write to register
Reg[ IR[20-16] ] ← MDR
T5 LW2 I-Format State Machine $rt=MDR
Decode
FetchExec ALU
Write Reg
Clock=1 AND R-Format=1
Clock=1 AND I-Format=1
Clock=1 AND opcode=LW
MemRead=1, MemWrite=0IorD=1IRWrite=0ALUOP=XALUSrcA=XALUSrcB=XPCWrite=0, PCSource=XRegWrite=1, MemtoReg=1, RegDst=1
MemRead=1, MemWrite=0IorD=1IRWrite=0ALUOP=XALUSrcA=XALUSrcB=XPCWrite=0, PCSource=XRegWrite=1, MemtoReg=1, RegDst=1
I-Format Register Write
ExecOffset
Clock=1 AND opcode=LW
Mem Read
Shift left 2
MemtoReg
IorD MemRead MemWrite
PC
MemoryMemData
Write data
M u x
0
1
RegistersWrite register
Write data
Read data 1
Read data 2
Read register 1
Read register 2
Instruction [15– 11]
M u x
0
1
M u x
0
1
4
ALUOpALUSrcB
RegDst RegWrite
Instruction [15– 0]
Instruction [5– 0]
Sign extend
3216
Instruction [25– 21]
Instruction [20– 16]
Instruction [15– 0]
Instruction register
1 M u x
0
32
ALU control
M u x
0
1ALU
resultALU
ALUSrcA
ZeroA
B
ALUOut
IRWrite
Address
Memory data
register
T4–SW2 Store instruction, write to memory
Memory[ ALUOut ] ← B
T4 SW2 I-Format State Machine Mem[ALU]=
Decode
FetchExec ALU
Write Reg
Clock=1 AND R-Format=1
Clock=1 AND I-Format=1
MemRead=0, MemWrite=1IorD=1IRWrite=0ALUOP=XALUSrcA=XALUSrcB=XPCWrite=0, PCSource=XRegWrite=0, MemtoReg=X, RegDst=X
MemRead=0, MemWrite=1IorD=1IRWrite=0ALUOP=XALUSrcA=XALUSrcB=XPCWrite=0, PCSource=XRegWrite=0, MemtoReg=X, RegDst=X
I-Format Memory Write
ExecOffset
Clock=1 AND opcode=SW
Store not Load!
Shift left 2
MemtoReg
IorD MemRead MemWrite
PC
MemoryMemData
Write data
M u x
0
1
RegistersWrite register
Write data
Read data 1
Read data 2
Read register 1
Read register 2
Instruction [15– 11]
M u x
0
1
M u x
0
1
4
ALUOpALUSrcB
RegDst RegWrite
Instruction [15– 0]
Instruction [5– 0]
Sign extend
3216
Instruction [25– 21]
Instruction [20– 16]
Instruction [15– 0]
Instruction register
1 M u x
0
32
ALU control
M u x
0
1ALU
resultALU
ALUSrcA
ZeroA
B
ALUOut
IRWrite
Address
Memory data
register
T3 BEQ1 (Conditional branch instruction)
ALUOut = Address computed in T2 !ALUOut = Address computed in T2 !
If (A - B == 0) { PC ← ALUOut; }
Zero
T3 BEQ1 I-Format State Machine =$rs + offset
Decode
FetchExec ALU
Write Reg
Clock=1 AND R-Format=1
Clock=1 AND opcode=branch
MemRead=0, MemWrite=0IorD=XIRWrite=0ALUOP=0 =subtractALUSrcA=1 =A =Reg[$rs]ALUSrcB=0 =B =Reg[$rt]PCWrite=0, PCWriteCond=1, PCSource=1 =ALUoutRegWrite=0, MemtoReg=X, RegDst=X
MemRead=0, MemWrite=0IorD=XIRWrite=0ALUOP=0 =subtractALUSrcA=1 =A =Reg[$rs]ALUSrcB=0 =B =Reg[$rt]PCWrite=0, PCWriteCond=1, PCSource=1 =ALUoutRegWrite=0, MemtoReg=X, RegDst=X
B-Format Execution
Shift left 2
MemtoReg
IorD MemRead MemWrite
PC
MemoryMemData
Write data
M u x
0
1
RegistersWrite register
Write data
Read data 1
Read data 2
Read register 1
Read register 2
Instruction [15– 11]
M u x
0
1
M u x
0
1
4
ALUOpALUSrcB
RegDst RegWrite
Instruction [15– 0]
Instruction [5– 0]
Sign extend
3216
Instruction [25– 21]
Instruction [20– 16]
Instruction [15– 0]
Instruction register
1 M u x
0
32
ALU control
M u x
0
1ALU
resultALU
ALUSrcA
ZeroA
B
ALUOut
IRWrite
Address
Memory data
register
T3 Jump1 (Jump Address)
PC ← PC[31-28] || IR[25-0]<<2
Moore Output State Tables: O(State)
T1
1
0
0=PC
1
0=+
0=PC
1=4
1
0=AL
0
X
X
T2
0
0
X
0
0
0
3
0
X
0
X
X
T3-R0
0
X
0
2=op
1=A=$rs
0=B=$rt
0
X
0
X
X
T4-R0
0
X
0
X
X
X
0
X
1
0=ALU
1=$rd
State
MemRead
MemWrite
MUX IorD
IRWrite
ALUOP
MUX ALUSrcA
MUX ALUSrcB
PCWrite
MUX PCSource
RegWrite
MUX MemtoReg
MUX RegDst
T3-I0
0
X
0
0=add
1=A=$rs
2=sign
0
X
0
X
X
T4-LW
1
0
1=ALU
0
X
X
X
0
X
0
X
X
T5-LW
0
0
X
0
X
X
X
0
X
1
1=MDR
1=$rt
T4-SW
0
1
1=ALU
0
X
X
X
0
X
0
X
X
Multi-cycle: 5 execution steps
• T1 (a,lw,sw,beq,j) Instruction Fetch
• T2 (a,lw,sw,beq,j) Instruction Decodeand Register Fetch
• T3 (a,lw,sw,beq,j) Execution, Memory Address Calculation,or Branch Completion
• T4 (a,lw,sw) Memory Accessor R-type instruction completion
• T5 (a,lw) Write-back step
INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!
Multi-cycle Approach
T1
T2
T3
T4
T5
Step nameAction for R-type
instructionsAction for memory-reference
instructionsAction for branches
Action for jumps
Instruction fetch IR = Memory[PC]PC = PC + 4
Instruction A = Reg [IR[25-21]]decode/register fetch B = Reg [IR[20-16]]
ALUOut = PC + (sign-extend (IR[15-0]) << 2)Execution, address ALUOut = A op B ALUOut = A + sign-extend if (A ==B) then PC = PC [31-28] IIcomputation, branch/ (IR[15-0]) PC = ALUOut (IR[25-0]<<2)jump completionMemory access or R-type Reg [IR[15-11]] = Load: MDR = Memory[ALUOut]completion ALUOut or
Store: Memory [ALUOut] = B
Memory read completion Load: Reg[IR[20-16]] = MDR
All operations in each clock cycle Ti are done in parallel not sequential!
For example, T1, IR = Memory[PC] and PC=PC+4 are done simultaneously!
Between Clock T2 and T3 the microcode sequencer will do a dispatch 1
Module 8:VHDL PROCESSES
Module 8:VHDL PROCESSES
2-to-1 Multiplexor: and Datapath multiplexor
0
1
a
b
S
Y
WITH s SELECTY <= a WHEN ‘0’,
b WHEN OTHERS;
WITH s SELECTY <= a WHEN ‘0’,
b WHEN OTHERS;
behavioral
WITH s SELECTY <= a WHEN ‘0’,
b WHEN OTHERS;
WITH s SELECTY <= a WHEN ‘0’,
b WHEN OTHERS;
Datapath is n bits wideDatapath is n bits wide
Where is the difference?Where is the difference?
0
1
a
b
S
Yn
nn
Generic 2-to-1 Datapath Multiplexor Entity
0
1
a
b
S
Yn
nn
LIBRARY IEEE;USE IEEE.std_logic_1164.all;USE IEEE.std_logic_arith.all;
ENTITY Generic_Mux ISGENERIC (n: INTEGER);PORT (Y: OUT std_logic_vector(n-1 downto 0);
a: IN std_logic_vector(n-1 downto 0);b: IN std_logic_vector(n-1 downto 0);S: IN std_logic_vector(0 downto 0)
);END ENTITY;
Generic 2-to-1 Datapath Multiplexor Architecture
ARCHITECTURE Generic_Mux_arch OF Generic_Mux ISBEGIN
WITH S SELECTY <= a WHEN "1",
b WHEN OTHERS;END ARCHITECTURE;
Configurations are require for simulationConfigurations are require for simulation
CONFIGURATION Generic_Mux_cfg OF Generic_Mux ISFOR Generic_Mux_archEND FOR;
END CONFIGURATION;
Structural SR Flip-Flop (Latch) NAND
R S Qn+10 0 U0 1 11 0 01 1 Qn
R
S
Q
Q
ENTITY Latch ISPORT(R, S: IN std_logic; Q, NQ: OUT std_logic);
END ENTITY;
ARCHITECTURE latch_arch OF Latch ISBEGIN
Q <= R NAND NQ;NQ <= S NAND Q;
END ARCHITECTURE;
Inferring Behavioral Latches: Asynchronous
ARCHITECTURE Latch2_arch OF Latch ISBEGIN
PROCESS (R, S) BEGINIF R= ‘0’ THEN
Q <= ‘1’; NQ<=‘0’;ELSIF S=‘0’ THEN
Q <= ‘0’; NQ<=‘1’;END IF;
END PROCESS;END ARCHITECTURE;
NANDR S Qn+10 0 U0 1 11 0 01 1 Qn
R
S
Q
Q
Sensitivity list of signals:Every time a change of state or event occurs on these signals this process will be called
Sensitivity list of signals:Every time a change of state or event occurs on these signals this process will be called
SequentialStatementsSequentialStatements
Gated-Clock SR Flip-Flop (Latch Enable) S
R
Q
QLE
ARCHITECTURE Latch_arch OF GC_Latch IS BEGINPROCESS (R, S, LE) BEGIN
IF LE=‘1’ THENIF R= ‘0’ THEN
Q <= ‘1’; NQ<=‘0’;ELSIF S=‘0’ THEN
Q <= ‘0’; NQ<=‘1’;END IF;
END IF;END PROCESS;
END ARCHITECTURE;
Inferring D-Flip Flops: Synchronous
ARCHITECTURE Dff_arch OF Dff ISBEGIN
PROCESS (Clock) BEGINIF Clock’EVENT AND Clock=‘1’ THEN
Q <= D;END IF;
END PROCESS;END ARCHITECTURE;
Sensitivity lists contain signals used in conditionals (i.e. IF)
Sensitivity lists contain signals used in conditionals (i.e. IF)
Notice the Process does not contain D:PROCESS(Clock, D)
Notice the Process does not contain D:PROCESS(Clock, D)
Clock’EVENT is what distinguishes a D-FlipFlip from a Latch
Clock’EVENT is what distinguishes a D-FlipFlip from a Latch
Inferring D-Flip Flops: rising_edge
ARCHITECTURE Dff_arch OF Dff IS BEGINPROCESS (Clock) BEGIN
IF Clock’EVENT AND Clock=‘1’ THENQ <= D;
END IF;END PROCESS;
END ARCHITECTURE;
ARCHITECTURE dff_arch OF dff IS BEGINPROCESS (Clock) BEGIN
IF rising_edge(Clock) THENQ <= D;
END IF;END PROCESS;
END ARCHITECTURE;
Alternate andmore readable way is to use the rising_edge function
Alternate andmore readable way is to use the rising_edge function
Inferring D-Flip Flops: Asynchronous Reset
ARCHITECTURE dff_reset_arch OF dff_reset IS BEGIN
PROCESS (Clock, Reset) BEGIN
IF Reset= ‘1’ THEN -- Asynchronous ResetQ <= ‘0’
ELSIF rising_edge(Clock) THEN --SynchronousQ <= D;
END IF;END PROCESS;
END ARCHITECTURE;
Inferring D-Flip Flops: Synchronous Reset
PROCESS (Clock, Reset) BEGINIF rising_edge(Clock) THEN
IF Reset=‘1’ THENQ <= ‘0’
ELSEQ <= D;
END IF;END IF;
END PROCESS;
PROCESS (Clock, Reset) BEGINIF rising_edge(Clock) THEN
IF Reset=‘1’ THENQ <= ‘0’
ELSEQ <= D;
END IF;END IF;
END PROCESS;
PROCESS (Clock, Reset) BEGINIF Reset=‘1’ THEN
Q <= ‘0’ELSIF rising_edge(Clock) THEN
Q <= D;END IF;
END PROCESS;
PROCESS (Clock, Reset) BEGINIF Reset=‘1’ THEN
Q <= ‘0’ELSIF rising_edge(Clock) THEN
Q <= D;END IF;
END PROCESS;
Synchronous Reset
Synchronous FF
Synchronous Reset
Synchronous FF
Asynchronous Reset
Synchronous FF
Asynchronous Reset
Synchronous FF
D-Flip Flops: Asynchronous Reset & Preset
PROCESS (Clock, Reset, Preset) BEGINIF Reset=‘1’ THEN --highest priority
Q <= ‘0’;ELSIF Preset=‘1’ THEN
Q <= ‘0’;ELSIF rising_edge(Clock) THEN
Q <= D;END IF;
END PROCESS;
PROCESS (Clock, Reset, Preset) BEGINIF Reset=‘1’ THEN --highest priority
Q <= ‘0’;ELSIF Preset=‘1’ THEN
Q <= ‘0’;ELSIF rising_edge(Clock) THEN
Q <= D;END IF;
END PROCESS;
RTL Multi-cycle Datapath: with controller
Shift left 2
PCM u x
0
1
RegistersWrite register
Write data
Read data 1
Read data 2
Read register 1
Read register 2
Instruction [15– 11]
M u x
0
1
M u x
0
1
4
Instruction [15– 0]
Sign extend
3216
Instruction [25– 21]
Instruction [20– 16]
Instruction [15– 0]
Instruction register
ALU control
ALU result
ALUZero
Memory data
register
A
B
IorD
MemRead
MemWrite
MemtoReg
PCWriteCond
PCWrite
IRWrite
ALUOp
ALUSrcB
ALUSrcA
RegDst
PCSource
RegWriteControl
Outputs
Op [5– 0]
Instruction [31-26]
Instruction [5– 0]
M u x
0
2
Jump address [31-0]Instruction [25– 0] 26 28
Shift left 2
PC [31-28]
1
1 M u x
0
32
M u x
0
1ALUOut
MemoryMemData
Write data
Address
Register Transfer Level (RTL) ViewRegister Transfer Level (RTL) View
CPU controller: Finite State Machine
ALUzero
PCWriteEnable
IRWrite
MemtoReg
MemWrite
PCWriteCond PCSourceMux
ALUSrcAMux
RegDstMuxRegWrite
ALUSrcBMux
ALUOp
IorDMux
MemReadPCWrite
IRopcodeResetClock
FSM
CPU Controller: Entity
ENTITY cpu_controller is PORT(CLK, RST :IN std_logic;IRopcode :IN std_logic_vector(5 downto 0);ALUzero :IN std_logic;PCWriteEnable :OUT std_logic;PCSourceMux :OUT std_logic_vector(1 downto 0);MemRead, MemWrite :OUT std_logic;IorDMux :OUT std_logic;IRWrite :OUT std_logic;RegWrite :OUT std_logic;RegDstMux :OUT std_logic;MemtoRegMux :OUT std_logic;ALUOp :OUT std_logic_vector(2 downto 0)ALUSrcAMux :OUT std_logic;ALUSrcBMux :OUT std_logic_vector(1 downto 0);
); END ENTITY;
CPU controller: R-Format State Machine
Decode
FetchExecRtype
WriteRtype
Clock=1 Clock=1
Clock=1Clock=1
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Ifetch Reg/Dec Exec WrR-Format
CPU Controller: Current State Process
ARCHITECTURE cpu_controller_arch OF cpu_controller ISTYPE CPUStates IS (Fetch, Decode, ExecRtype, WriteRtype);SIGNAL State, NextState :CPUStates;
BEGIN
PROCESS (State) BEGINCASE State IS
WHEN Fetch => NextState <= Decode;WHEN Decode => NextState <= ExecRtype;WHEN ExecRtype => NextState <= WriteRtype;WHEN WriteRtype => NextState <= Fetch;WHEN OTHERS => NextState <= Fetch;
END CASE;END PROCESS;• • •
CPU controller: NextState Clock Process
PROCESS (CLK, RST) BEGINIF RST='1' THEN -- Asynchronous ResetState <= Fetch;
ELSIF rising_edge(CLK) THENState <= NextState;
END IF;END PROCESS;
END ARCHITECTURE;
T1 Fetch: State machine
MemRead=1, MemWrite=0IorD=1 (MemAddr←PC)IRWrite=1 (IR←Mem[PC])ALUOP=ADD (PC←4+PC)ALUSrcA=0 (=PC)ALUSrcB=1 (=4)PCWrite=1, PCSource=1 (=ALU)RegWrite=0, RegDst=X, MemtoReg=X
MemRead=1, MemWrite=0IorD=1 (MemAddr←PC)IRWrite=1 (IR←Mem[PC])ALUOP=ADD (PC←4+PC)ALUSrcA=0 (=PC)ALUSrcB=1 (=4)PCWrite=1, PCSource=1 (=ALU)RegWrite=0, RegDst=X, MemtoReg=X
Start
Instruction Fetch
Decode
Exec
Write Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Ifetch Reg/Dec Exec WrR-Format
T1 Fetch: VHDL with Moore Output States
PROCESS (State) BEGINCASE State ISWHEN Fetch =>
NextState <= Decode;MemRead <= '1';MemWrite <= '0';IorD <= '1'IRWrite <= '1';ALUOp <= "010”; --addALUSrcAMux <= '1'; --PCALUSrcBMux <= "01"; --4PCWriteEnable<= '1';PCSourceMux <= "00"; --ALU (not ALUOut)RegWrite <= '0';RegDstMux <= 'D'; MemtoReg <= 'D';
MemRead=1, MemWrite=0IorD=1 (MemAddr←PC)IRWrite=1 (IR←Mem[PC])ALUOP=ADD (PC←4+PC)ALUSrcA=0 (=PC)ALUSrcB=1 (=4)PCWrite=1, PCSource=1 (=ALU)RegWrite=0, RegDst=X, MemtoReg=X
MemRead=1, MemWrite=0IorD=1 (MemAddr←PC)IRWrite=1 (IR←Mem[PC])ALUOP=ADD (PC←4+PC)ALUSrcA=0 (=PC)ALUSrcB=1 (=4)PCWrite=1, PCSource=1 (=ALU)RegWrite=0, RegDst=X, MemtoReg=X
Instruction Fetch
'D' for Don’t Care'D' for Don’t Care
VHDL inferred Latches: WARNING
In VHDL case statement
The same signal must be defined for each case
Otherwise that signal will be inferred as a latch
and not as combinatorial logic!
For example,
Even though RegDstMux <= 'D' is not used
and was removed from the Decode state
This will result in a RegDstMuxbeing inferred as latch not as logic
even though in the WriteRtype state it is set
Assignment #3: CPU Architecture design (1/3)
Cyber Dynamics Corporation (18144 El Camino Real, S’Vale California) needs the following embedded model 101 microprocessor designed by Thursday October 5, 2000 with the following specifications
• 16 bit instruction memory using ROM• 8 bit data memory using RAM• There are eight 8-bit registers• The instruction set is as follows
• All Arithmetic and logical instructions set a Zero one-bit flag (Z) based on ALU result
• add, adc, sub, sbc set the Carry/Borrow one-bit Flag (C)based on ALU result
Assignment #3: CPU Architecture design (2/3)Arithmetic and logical instructions
add $rt,$rs #$rt = $rt +$rs; C=ALUcarry; Z=$rtadc $rt, $rs #$rt = $rt +$rs+C; C=ALUcarry; Z;sub $rt, $rs #$rt = $rt - $rs; C=ALUborrow; Z;sub $rt, $rs #$rt = $rt - $rs - borrow; C; Z;and $rt, $rs #$rt = $rt & $rs; C=0; Z=$rt;or $rt, $rs #$rt = $rt | $rs; C=0; Z=$rt;xor $rt, $rs #$rt = $rt ^ $rs; C=1; Z=$rt;
Other Instructions continued):lbi $r, immed #$r = immediatelbr $rt,$rs #$rt = Mem[$rs]lb $r, address #$r = Mem[address]stb $r, address #Mem[address]=$rbz address #if zero then pc=pc+2+addrbc address #if carry then pc=pc+2+addrj address #pc = addressjr $r #pc = $r
Assignment #3: CPU Architecture design (2/3)
(1a) Design an RTL diagram of the model 101 processor and
(1b) Opcode formats and
(1c) Opcode bits of a Harvard Architecture CPU (determine your own field sizes).
(1d) What is the size of the Program Counter?
(2) Write the assembly code for a 32 bit add Z=X+Y located in memory address for @X = 0x80 @Y = 0x84 and @Z=0x88
(3) Draw the state diagram for your FSM controller
(4) Write the VHDL code for the FSM controller
Note: this will be part of your final project report
Module 9:Improving Memory Access: Direct and Spatial caches
Module 9:Improving Memory Access: Direct and Spatial caches
The Art of Memory System Design
Processor
$
MEM
Memory
reference stream <op,addr>, <op,addr>,<op,addr>,<op,addr>, . . .
op: i-fetch, read, write
Optimize the memory system organizationto minimize the average memory access timefor typical workloads
Workload orBenchmarkprograms
Pipelining and the cache (Designing…,M.J.Quinn, ‘87)
Instruction Pipelining is the use of pipelining to allow more than one instruction to be in some stage of execution at the same time.
Ferranti ATLAS (1963):• Pipelining reduced the average time per instruction by 375%• Memory could not keep up with the CPU, needed a cache.
Cache memory is a small, fast memory unit used as a buffer between a processor and primary memory
Principle of Locality
• Principle of Localitystates that programs access a relatively small portionof their address space at any instance of time
• Two types of locality
• Temporal locality (locality in time)If an item is referenced, then
the same item will tend to be referenced soon“the tendency to reuse recently accessed data items”
• Spatial locality (locality in space)If an item is referenced, then
nearby items will be referenced soon“the tendency to reference nearby data items”
Memory Hierarchy
RegistersRegisters
PipeliningPipelining
Cache memoryCache memory
Primary real memoryPrimary real memory
Virtual memory (Disk, swapping)Virtual memory (Disk, swapping)
Fast
er
Che
aper
Cos
t $$$
Mor
e C
apac
ity
CPUCPU
Memory Hierarchy of a Modern Computer Syste
•By taking advantage of the principle of locality:
•Present the user with as much memory as is available in the cheapest technology.
•Provide access at the speed offered by the fastest technology.
Control
Datapath
SecondaryStorage(Disk)
Processor
Registers
MainMemory(DRAM)
SecondLevelCache
(SRAM)
On-C
hipC
ache1s 10,000,000s
(10s ms)Speed (ns): 10s 100s
100sGs
Size (bytes):Ks Ms
TertiaryStorage(Disk)
10,000,000,000s (10s sec)
Ts
Cache Memory Technology: SRAM 1 bit cell layout
Memories Technology and Principle of Locality
• Faster Memories are more expensive per bit
Memory Technology
Typical access time
$ per Mbyte in 1997
SRAM 5-25 ns $100-$250
DRAM 60-120 ns $5-$10
Magnetic Disk 10-20 million ns $0.10-$0.20
• Slower Memories are usually smaller in area size per bit
Cache Memory Technology: SRAM
• Why use SRAM (Static Random Access Memory)?
see reference: http://www.chips.ibm.com/products/memory/sramoperations/sramop.html
• Speed.The primary advantage of an SRAM over DRAM is speed.
The fastest DRAMs on the market still require 5 to 10processor clock cycles to access the first bit of data.
SRAMs can operate at processor speeds of 250 MHzand beyond, with access and cycle timesequal to the clock cycle used by the microprocessor
• Density.when 64 Mb DRAMs are rolling off the production lines,the largest SRAMs are expected to be only 16 Mb.
Cache Memory Technology: SRAM (con’t)
• Volatility.Unlike DRAMs, SRAM cells do not need to be refreshed.SRAMs are available 100% of the time for reading & writing.
• Cost. If cost is the primary factor in a memory design,
then DRAMs win hands down.
If, on the other hand, performance is a critical factor,then a well-designed SRAM is an effective costperformance solution.
•By taking advantage of the principle of locality:•Present the user with as much memory as is available in
the cheapest technology.•Provide access at the speed offered by the fastest
technology.
Memory Hierarchy of a Modern Computer Syste
•DRAM is slow but cheap and dense:•Good choice for presenting the user with a BIG memory
system
•SRAM is fast but expensive and not very dense:•Good choice for providing the user FAST access time.
Cache Terminology
A hit if the data requested by the CPU is in the upper level
A miss if the data is not found in the upper level
Hit rate or Hit ratiois the fraction of accesses found in the upper level
Miss rate or (1 – hit rate)is the fraction of accesses not found in the upper level
Hit timeis the time required to access data in the upper level= <detection time for hit or miss> + <hit access time>
Miss penaltyis the time required to access data in the lower level= <lower access time>+<reload processor time>
Cache Example
Processor
Data are transferred
Time 1: Hit: in cacheTime 1: Hit: in cache
Time 1: MissTime 1: Miss
Time 3: deliver to CPUTime 3: deliver to CPU
Time 2: fetch from lower level into cacheTime 2: fetch from lower level into cache
Hit time = Time 1 Miss penalty = Time 2 + Time 3
Basic Cache System
Cache Memory Technology: SRAM Block diagram
Cache Memory Technology: SRAM timing diagram
Direct Mapped Cache
0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 0 1
000
C a ch e
M e m o ry
001
010
011
100
101
110
111
• Direct Mapped: assign the cache location based on theaddress of the word in memory
• cache_address = memory_address modulo cache_size;
Observe there is a Many-to-1 memory to cache relationship
Direct Mapped Cache: Data Structure
There is a Many-to-1 relationship between memory and cache
How do we know whether the data in the cache corresponds to the requested word?
tags• contain the address information required to identifywhether a word in the cache corresponds to therequested word.
• tags need only to contain the upper portion of thememory address (often referred to as a page address)
valid bit• indicates whether an entry contains a valid address
Direct Mapped Cache: Temporal Example
lw $1,22($0)lw $1,10 110 ($0)lw $2,11 010 ($0)lw $3,10 110 ($0)
lw $2,26($0)lw $3,22($0)
Index Valid Tag Data000 N001 N010 N011 N100 N101 N110 N111 N
Y 10 Memory[10110]
Y 11 Memory[11010]
Miss: validMiss: valid
Miss: validMiss: valid
Hit!Hit!
Direct Mapped Cache: Worst case, always miss!
lw $1,22($0)lw $1,10 110 ($0)lw $2,11 110 ($0)lw $3,00 110 ($0)
lw $2,30($0)lw $3,6($0)
Index Valid Tag Data000 N001 N010 N011 N100 N101 N110 N111 N
Y 10 Memory[10110]Y 11 Memory[11110]
Miss: validMiss: valid
Miss: tagMiss: tag
Miss: tagMiss: tag
Y 00 Memory[00110]
A d d r e s s ( s h o w i n g b i t p o s i t i o n s )
2 0 1 0
B y t e o f f s e t
V a l i d T a g D a t aI n d e x012
1 0 2 11 0 2 21 0 2 3
T a g
I n d e x
H i t D a t a
2 0 3 2
3 1 3 0 1 3 1 2 1 1 2 1 0TagTag IndexIndex
Direct Mapped Cache: Mips Architecture
DataData
Compare TagsCompare Tags
HitHit
Bits in a Direct Mapped CacheHow many total bits are required for a direct mapped cache
with 64KB (= 216 KiloBytes) of dataand one word (=32 bit) blocksassuming a 32 bit byte memory address?
Cache index width = log2 words= log2 216/4 = log2 214 words = 14 bits
Tag size = <block address width> – <cache index width>= 30 – 14 = 16 bits
Block address width = <byte address width> – log2 word = 32 – 2 = 30 bits
Cache block size = <valid size>+<tag size>+<block data size>= 1 bit + 16 bits + 32 bits = 49 bits
Total size = <Cache word size> × <Cache block size>= 214 words × 49 bits = 784 × 210 = 784 Kbits = 98 KB= 98 KB/64 KB = 1.5 times overhead
Harvard architecture was coined to describe machines with separate memories.Speed efficient: Increased parallelism (split cache).
instructions data
ALU I/OALU I/O
instructions
and
data
Data busAddress bus
Von Neuman architectureArea efficient but requires higher bus bandwidth because instructions and data must compete for memory.
Split Cache: Exploiting the Harvard Architectures
Modern Systems: Pentium Pro and PowerPC
Characteristic Intel Pentium Pro PowerPC 604Cache organization Split instruction and data caches Split intruction and data cachesCache size 8 KB each for instructions/data 16 KB each for instructions/dataCache associativity Four-way set associative Four-way set associativeReplacement Approximated LRU replacement LRU replacementBlock size 32 bytes 32 bytesWrite policy Write-back Write-back or write-through
RAM (main memory) : von Neuman Architecture
Cache: uses Harvard Architecture separate Instruction/Data caches
RAM (main memory) : von Neuman Architecture
Cache: uses Harvard Architecture separate Instruction/Data caches
Cache schemeswrite-through cache
Always write the data into both thecache and memory and then wait for memory.
write-back cacheWrite data into the cache block andonly write to memory when block is modifiedbut complex to implement in hardware.
No amount of buffering can helpif writes are being generated fasterthan the memory system can accept them.
write bufferwrite data into cache and write buffer.If write buffer full processor must stall.
Chip Area Speed
• Read hits• this is what we want!
Hits vs. Misses
• Read misses• stall the CPU, fetch block from memory,
deliver to cache, and restart.
• Write hits• write-through: can replace data in cache and memory.• write-buffer: write data into cache and buffer.• write-back: write the data only into the cache.
• Write misses• read the entire block into the cache, then write the word.
Example: The DECStation 3100 cache
DECStation uses a write-through harvard architecture cache• 128 KB total cache size (=32K words)
• = 64 KB instruction cache (=16K words)• + 64 KB data cache (=16K words)
• 10 processor clock cycles to write to memory
The DECStation 3100 miss rates
• A split instruction and data cache increases the bandwidth
6.1%
2.1%
5.4%
Benchmark Program
gcc
Instructionmiss rate
Datamiss rate
Effective split miss rate
Combined miss rate
4.8%
spice
1.2%
1.3%
1.2%
split cache has slightly worse miss ratesplit cache has slightly worse miss rate
Why a lower miss rate?Why a lower miss rate?
Numerical programstend to consist of a lot of small program loops
Numerical programstend to consist of a lot of small program loops
1.2% miss, also means that 98.2% of the time it is in the cache. So using a cache pays off!
1.2% miss, also means that 98.2% of the time it is in the cache. So using a cache pays off!
Review: Principle of Locality
• Principle of Localitystates that programs access a relatively small portionof their address space at any instance of time
• Two types of locality
• Temporal locality (locality in time)If an item is referenced, then
the same item will tend to be referenced soon“the tendency to reuse recently accessed data items”
• Spatial locality (locality in space)If an item is referenced, then
nearby items will be referenced soon“the tendency to reference nearby data items”
Spatial Locality
• Temporal only cachecache block contains only one word (No spatial locality).
• Spatial localityCache block contains multiple words.
• When a miss occurs, then fetch multiple words.
• AdvantageHit ratio increases because there is a highprobability that the adjacent words will beneeded shortly.
• DisadvantageMiss penalty increases with block size
Spatial Locality: 64 KB cache, 4 words
• 64KB cache using four-word (16-byte word)• 16 bit tag, 12 bit index, 2 bit block offset, 2 bit byte offset.
Address (showing bit positions)
16 12 Byte offset
V Tag Data
Hit Data
16 32
4K entries
16 bits 128 bits
Mux
32 32 32
2
32
Block offsetIndex
Tag
31 16 15 4 3 2 1 0
• Use split caches because there is more spatial locality in code:
Performance
6.1%
2.1%
5.4%
ProgramBlock size
gcc=1
Instructionmiss rate
Datamiss rate
Effective split miss rate
Combined miss rate
4.8%
gcc=4
2.0%
1.7%
1.9%
4.8%
spice=1
1.2%
1.3%
1.2%
spice=4
0.3%
0.6%
0.4%
Temporal only split cache: has slightly worse miss rateTemporal only split cache: has slightly worse miss rate
Spatial split cache: has lower miss rateSpatial split cache: has lower miss rate
• Increasing the block size tends to decrease miss rate:
Cache Block size Performance
1 K B 8 K B 1 6 K B 6 4 K B 2 5 6 K B
2 5 6
4 0 %
3 5 %
3 0 %
2 5 %
2 0 %
1 5 %
1 0 %
5 %
0 %
Mis
s ra
te
6 41 64
B lo c k s iz e (b y te s )